We have an Amazon S3 bucket that contains around a million JSON files, each one around 500KB compressed. These files are put there by AWS Kinesis Firehose, and a new one is written every 5 minutes. These files all describe similar events and so are logically all the same, and are all valid JSON, but have different structures/hierarchies. Also their format & line endings are inconsistent: some objects are on a single line, some on many lines, and sometimes the end of one object is on the same line as the start of another object (i.e., We have an Amazon S3 bucket that contains aroun