How to split line delimited JSON into many files using linux shell script

Question

I have a huge newline delimited JSON file input.json which like this:

{ "name":"a.txt", "content":"...", "other_keys":"..."}
{ "name":"b.txt", "content":"...", "something_else":"..."}
{ "name":"c.txt", "content":"...", "etc":"..."}
...

How can I split it into multiple text files, where file names are taken from "name" and file content is taken from "content"? Other keys can be ignored. Currently toying with jq tool without luck.

jq can collect the objects with the same name and content, but it doesn't have the ability to open and write to arbitrary files. — chepner
– chepner, Commented Dec 20, 2019 at 16:20

peak · Accepted Answer · 2019-12-20 19:51:35Z

1

The key to an efficient, jq-based solution is to pipe the output of jq (invoked with the -c option) to a program such as awk to perform the actual writing of the output files.

jq -c '.name, .content' input.json | 
  awk 'fn {print > fn; close(fn); fn=""; next;}
       {fn=$0; sub(/^"/,"",fn); sub(/"$/,"",fn);}'

Warnings

Blindly relying on the JSON input for the file names has some risks, e.g.

what if the same "name" is specified more than once?
if a file already exists, the above program will simply append to it.

Also, somewhere along the line, the validity of .name as a filename should be checked.

Comments

chepner · Accepted Answer · 2019-12-20 16:30:29Z

jq doesn't have the output capabilities to create the desired files after grouping the objects; you'll need to use another language with a JSON library. An example using Python:

import json
import fileinput

for line in fileinput.input():  # Read from standard input or filename arguments
    d = json.loads(line)
    with open(d['name'], "a") as f:
        print(d['content'], file=f)

This has the drawback of repeatedly opening and closing each file multiple times, but it's simple. A more complex, but more efficient, example would use an exit stack context manager.

import json
import fileinput
import contextlib

with contextlib.ExitStack() as es:
    files = {}
    for line in fileinput.input():
        d = json.loads(line)
        file_name = d['name']
        if file_name not in files:
            files[file_name] = es.enter_context(open(file_name, "w"))
        print(d['content'], file=files[file_name])

Put briefly, files are opened and cached as they are discovered. Once the loop completes (or in the event of an exception), the exit stack ensures all files previously opened are properly closed.

If there's a chance that there will be too many files to have open simultaneously, you'll have to use the simple-but-inefficient code, though you could implement something even more complex that just keeps a small, fixed number of files open at any given time, reopening them in append mode as necessary. Implementing that is beyond the scope of this answer, though.

peak · Accepted Answer · 2019-12-20 20:08:39Z

-1

The following jq-based solution ensures that the output in the JSON files is pretty-printed, but ignores any input object with .content equal to the JSON string: "IGNORE ME":

jq 'if .content == "IGNORE ME" 
    then "Skipping IGNORE ME" | stderr | empty
    else .name, .content, "IGNORE ME" end' input.json |
    awk '/^"IGNORE ME"$/ {close(fn); fn=""; next}
         fn {print >> fn; next}
         {fn=$0; sub(/^"/,"",fn); sub(/"$/,"",fn);}'

answered Dec 20, 2019 at 20:08

peak

119k21 gold badges185 silver badges218 bronze badges

Collectives™ on Stack Overflow

How to split line delimited JSON into many files using linux shell script

3 Answers 3

Warnings

Related answers on SO

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Warnings

Related answers on SO

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related