Split JSON array into separate files/objects

Question

I have JSON exported from Cassandra in this format.

[
  {
    "correlationId": "2232845a8556cd3219e46ab8",
    "leg": 0,
    "tag": "received",
    "offset": 263128,
    "len": 30,
    "prev": {
      "page": {
        "file": 0,
        "page": 0
      },
      "record": 0
    },
    "data": "HEAD /healthcheck HTTP/1.1\r\n\r\n"
  },
  {
    "correlationId": "2232845a8556cd3219e46ab8",
    "leg": 0,
    "tag": "sent",
    "offset": 262971,
    "len": 157,
    "prev": {
      "page": {
        "file": 10330,
        "page": 6
      },
      "record": 1271
    },
    "data": "HTTP/1.1 200 OK\r\nDate: Wed, 14 Feb 2018 12:57:06 GMT\r\nServer: \r\nConnection: close\r\nX-CorrelationID: Id-2232845a8556cd3219e46ab8 0\r\nContent-Type: text/xml\r\n\r\n"
  }]

I would like to split it to separate documents:

{ "correlationId": "2232845a8556cd3219e46ab8", "leg": 0, "tag": "received", "offset": 263128, "len": 30, "prev": { "page": { "file": 0, "page": 0 }, "record": 0 }, "data": "HEAD /healthcheck HTTP/1.1\r\n\r\n" }

and

{ "correlationId": "2232845a8556cd3219e46ab8", "leg": 0, "tag": "sent", "offset": 262971, "len": 157, "prev": { "page": { "file": 10330, "page": 6 }, "record": 1271 }, "data": "HTTP/1.1 200 OK\r\nDate: Wed, 14 Feb 2018 12:57:06 GMT\r\nServer: \r\nConnection: close\r\nX-CorrelationID: Id-2232845a8556cd3219e46ab8 0\r\nContent-Type: text/xml\r\n\r\n" }

I wanted to use jq but didn't find way how.

Can you please advise way, how to split it by the document separator?

Thanks, Reddy

Do you need it to work for an arbitrary number of documents, or specifically for two documents? — John Zwinck
– John Zwinck, Commented Feb 14, 2018 at 16:15

djangonaut · Accepted Answer · 2018-09-17 07:50:59Z

9

To split a json with many records into chunks of a desired size I simply use:

jq -c '.[0:1000]' mybig.json

which works like python slicing.

See the docs here: https://stedolan.github.io/jq/manual/

Array/String Slice: .[10:15]

The .[10:15] syntax can be used to return a subarray of an array or substring of a string. The array returned by .[10:15] will be of length 5, containing the elements from index 10 (inclusive) to index 15 (exclusive). Either index may be negative (in which case it counts backwards from the end of the array), or omitted (in which case it refers to the start or end of the array).

answered Sep 17, 2018 at 7:50

djangonaut

7,8296 gold badges41 silver badges57 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

peak · Accepted Answer · 2018-02-15 08:06:26Z

8

Using jq, one can split an array into its components using the filter:

.[]

The question then becomes what is to be done with each component. If you want to direct each component to a separate file, you could (for example) use jq with the -c option, and filter the result into awk, which can then allocate the components to different files. See e.g. Split JSON File Objects Into Multiple Files

Performance considerations

One might think that the overhead of calling jq+awk would be high compared to calling python, but both jq and awk are lightweight compared to python+json, as suggested by these timings (using Python 2.7.10):

time (jq -c  .[] input.json | awk '{print > "doc00" NR ".json";}')
user    0m0.005s
sys     0m0.008s

time python split.py
user    0m0.016s
sys     0m0.046s

edited Feb 15, 2018 at 8:06

answered Feb 15, 2018 at 7:06

peak

119k21 gold badges185 silver badges218 bronze badges

Comments

John Zwinck · Accepted Answer · 2018-02-14 16:14:37Z

4

You can do it more efficiently using Python (because you can read the entire input once, instead of once per document):

import json

docs = json.load(open('in.json'))

for ii, doc in enumerate(docs):
    with open('doc{}.json'.format(ii), 'w') as out:
        json.dump(doc, out, indent=2)

edited Feb 14, 2018 at 16:14

answered Feb 14, 2018 at 15:59

John Zwinck

252k44 gold badges346 silver badges459 bronze badges

2 Comments

peak Over a year ago

@JohnZwinck - If you mean more efficient than multiple invocations of jq, perhaps you could say so, though for N=2, my timings indicate that for the specific data provided by the OP, your python solution is overall more than five times slower than the twice-jq solution.

John Zwinck Over a year ago

@peak: Nobody cares about the performance of a degenerate case like N=2. Can you try it with N=10000, please?

RomanPerekhrest · Accepted Answer · 2018-02-14 15:49:23Z

2

In case you have an array of 2 objects:

jq '.[0]' input.json > doc1.json && jq '.[1]' input.json > doc2.json

Results:

$ head -n100 doc[12].json
==> doc1.json <==
{
  "correlationId": "2232845a8556cd3219e46ab8",
  "leg": 0,
  "tag": "received",
  "offset": 263128,
  "len": 30,
  "prev": {
    "page": {
      "file": 0,
      "page": 0
    },
    "record": 0
  },
  "data": "HEAD /healthcheck HTTP/1.1\r\n\r\n"
}

==> doc2.json <==
{
  "correlationId": "2232845a8556cd3219e46ab8",
  "leg": 0,
  "tag": "sent",
  "offset": 262971,
  "len": 157,
  "prev": {
    "page": {
      "file": 10330,
      "page": 6
    },
    "record": 1271
  },
  "data": "HTTP/1.1 200 OK\r\nDate: Wed, 14 Feb 2018 12:57:06 GMT\r\nServer: \r\nConnection: close\r\nX-CorrelationID: Id-2232845a8556cd3219e46ab8 0\r\nContent-Type: text/xml\r\n\r\n"
}

answered Feb 14, 2018 at 15:49

RomanPerekhrest

93.1k4 gold badges75 silver badges112 bronze badges

2 Comments

John Zwinck Over a year ago

You should probably use jq '. | length' input.json to get the number of documents, then loop that many times.

RomanPerekhrest Over a year ago

@JohnZwinck, before making such statements you should have elaborated that moment with OP ; in case you're considering that moment as critical. Anyway, I don't think such behavior is reputable for 126K contributor. I'm disappointed

spotchi · Accepted Answer · 2019-08-14 11:28:21Z

1

One way to do this is using jq's stream option and piping that to the split command

jq -cn --stream 'fromstream(1|truncate_stream(inputs))' bigfile.json | split -l $num_of_elements_in_a_file - big_part

The number of lines per file varies according to the value that you put into num_of_elements_in_a_file,

You can check out this answer Using jq how can I split a very large JSON file into multiple files, each a specific quantity of objects? which refers to this page for a discussion on how to use the streaming parser https://github.com/stedolan/jq/wiki/FAQ#streaming-json-parser

answered Aug 14, 2019 at 11:28

spotchi

213 bronze badges

Comments

jmariano · Accepted Answer · 2022-09-16 11:59:58Z

1

Just adding another example. jq -c '.[0:10]' large_json.json > outputtosmall.json

answered Sep 16, 2022 at 11:59

jmariano

113 bronze badges

1 Comment

Adam Calvet Bohl Over a year ago

You can format your code using ` character.

Collectives™ on Stack Overflow

Split JSON array into separate files/objects

6 Answers 6

Comments

Performance considerations

Comments

2 Comments

2 Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

Comments

Performance considerations

Comments

2 Comments

2 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related