0

I have a huge newline delimited JSON file input.json which like this:

{ "name":"a.txt", "content":"...", "other_keys":"..."}
{ "name":"b.txt", "content":"...", "something_else":"..."}
{ "name":"c.txt", "content":"...", "etc":"..."}
...

How can I split it into multiple text files, where file names are taken from "name" and file content is taken from "content"? Other keys can be ignored. Currently toying with jq tool without luck.

1
  • jq can collect the objects with the same name and content, but it doesn't have the ability to open and write to arbitrary files. Commented Dec 20, 2019 at 16:20

3 Answers 3

1

The key to an efficient, jq-based solution is to pipe the output of jq (invoked with the -c option) to a program such as awk to perform the actual writing of the output files.

jq -c '.name, .content' input.json | 
  awk 'fn {print > fn; close(fn); fn=""; next;}
       {fn=$0; sub(/^"/,"",fn); sub(/"$/,"",fn);}' 

Warnings

Blindly relying on the JSON input for the file names has some risks, e.g.

  • what if the same "name" is specified more than once?
  • if a file already exists, the above program will simply append to it.

Also, somewhere along the line, the validity of .name as a filename should be checked.

Related answers on SO

This question has been asked and answered on SO in slightly different forms before, see e.g. Split a JSON file into separate files

Sign up to request clarification or add additional context in comments.

Comments

-1

jq doesn't have the output capabilities to create the desired files after grouping the objects; you'll need to use another language with a JSON library. An example using Python:

import json
import fileinput

for line in fileinput.input():  # Read from standard input or filename arguments
    d = json.loads(line)
    with open(d['name'], "a") as f:
        print(d['content'], file=f)

This has the drawback of repeatedly opening and closing each file multiple times, but it's simple. A more complex, but more efficient, example would use an exit stack context manager.

import json
import fileinput
import contextlib

with contextlib.ExitStack() as es:
    files = {}
    for line in fileinput.input():
        d = json.loads(line)
        file_name = d['name']
        if file_name not in files:
            files[file_name] = es.enter_context(open(file_name, "w"))
        print(d['content'], file=files[file_name])

Put briefly, files are opened and cached as they are discovered. Once the loop completes (or in the event of an exception), the exit stack ensures all files previously opened are properly closed.

If there's a chance that there will be too many files to have open simultaneously, you'll have to use the simple-but-inefficient code, though you could implement something even more complex that just keeps a small, fixed number of files open at any given time, reopening them in append mode as necessary. Implementing that is beyond the scope of this answer, though.

Comments

-1

The following jq-based solution ensures that the output in the JSON files is pretty-printed, but ignores any input object with .content equal to the JSON string: "IGNORE ME":

jq 'if .content == "IGNORE ME" 
    then "Skipping IGNORE ME" | stderr | empty
    else .name, .content, "IGNORE ME" end' input.json |
    awk '/^"IGNORE ME"$/ {close(fn); fn=""; next}
         fn {print >> fn; next}
         {fn=$0; sub(/^"/,"",fn); sub(/"$/,"",fn);}'

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.