jq: How can I pipe objects from array to different files based on data in object?

Question

I have a large array of objects stored in a master JSON file. I want to loop through that array, take each object, and append it to a new file based on a field in the object (in this case, the state name). In other words, in a set of data containing many states, I want to filter it out to a file for each state.

I'm using an existing JQ expression to filter for only the data I actually need:

{ fipscode: .fipscode, level: .level, polid: .polid, polnum: .polnum, precinctsreporting: .precinctsreporting, precinctsreportingpct: .precinctsreportingpct, precinctstotal: .precinctstotal, raceid: .raceid, runoff: .runoff, statepostal: .statepostal, votecount: .votecount, votepct: .votepct, winner: .winner }

Here's a sample of my input:

[
    { "ballotorder": 2, "candidateid": "9718", "delegatecount": 0, "description": null, "electiondate": "2018-08-28", "electtotal": 0, "electwon": 0, "fipscode": null, "first": "Doug", "id": "3015-polid-64364-state-AZ-1", "incumbent": true, "initialization_data": false, "is_ballot_measure": false, "last": "Ducey", "lastupdated": "2018-08-30T00:01:38.897Z", "level": "state", "national": true, "officeid": "G", "officename": "Governor", "party": "GOP", "polid": "64364", "polnum": "5554", "precinctsreporting": 1488, "precinctsreportingpct": 0.9993000000000001, "precinctstotal": 1489, "raceid": "3015", "racetype": "Primary", "racetypeid": "R", "reportingunitid": "state-AZ-1", "reportingunitname": null, "runoff": false, "seatname": null, "seatnum": null, "statename": "Arizona", "statepostal": "AZ", "test": false, "uncontested": false, "votecount": 355455, "votepct": 0.705493, "winner": true },
    { "ballotorder": 2, "candidateid": "21689", "delegatecount": 0, "description": null, "electiondate": "2018-08-28", "electtotal": 0, "electwon": 0, "fipscode": null, "first": "Ron", "id": "10046-polid-62557-state-FL-1", "incumbent": false, "initialization_data": false, "is_ballot_measure": false, "last": "DeSantis", "lastupdated": "2018-08-29T19:29:50.367Z", "level": "state", "national": true, "officeid": "G", "officename": "Governor", "party": "GOP", "polid": "62557", "polnum": "13918", "precinctsreporting": 5968, "precinctsreportingpct": 1.0, "precinctstotal": 5968, "raceid": "10046", "racetype": "Primary", "racetypeid": "R", "reportingunitid": "state-FL-1", "reportingunitname": null, "runoff": false, "seatname": null, "seatnum": null, "statename": "Florida", "statepostal": "FL", "test": false, "uncontested": false, "votecount": 913997, "votepct": 0.564728, "winner": true },
    { "ballotorder": 2, "candidateid": "45555", "delegatecount": 0, "description": null, "electiondate": "2018-08-28", "electtotal": 0, "electwon": 0, "fipscode": null, "first": "Rex", "id": "38538-polid-67011-state-OK-1", "incumbent": false, "initialization_data": false, "is_ballot_measure": false, "last": "Lawhorn", "lastupdated": "2018-08-29T02:44:44.610Z", "level": "state", "national": true, "officeid": "G", "officename": "Governor", "party": "Lib", "polid": "67011", "polnum": "40784", "precinctsreporting": 1951, "precinctsreportingpct": 1.0, "precinctstotal": 1951, "raceid": "38538", "racetype": "Runoff", "racetypeid": "L", "reportingunitid": "state-OK-1", "reportingunitname": null, "runoff": false, "seatname": null, "seatnum": null, "statename": "Oklahoma", "statepostal": "OK", "test": false, "uncontested": false, "votecount": 379, "votepct": 0.409287, "winner": false }
]

As output, I would expect to have a Arizona.json containing only the item(s) from that state, and also filtered to remove unwanted fields:

[
  { "fipscode": null, "level": "state", "polid": "64364", "polnum": "5554", "precinctsreporting": 1488, "precinctsreportingpct": 0.9993000000000001, "precinctstotal": 1489, "raceid": "3015", "runoff": false, "statepostal": "AZ", "votecount": 355455, "votepct": 0.705493, "winner": true }
]

...and likewise for the other states involved (Florida.json and Oklahoma.json).

Here's the bash and jq script I have so far:

cat master.json |
jq -cn --stream 'fromstream(1|truncate_stream(inputs))' |
jq -c '.statename as $state | {
    fipscode: .fipscode,
    level: .level,
    polid: .polid,
    polnum: .polnum,
    precinctsreporting: .precinctsreporting,
    precinctsreportingpct: .precinctsreportingpct,
    precinctstotal: .precinctstotal,
    raceid: .raceid,
    runoff: .runoff,
    statepostal: .statepostal,
    votecount: .votecount,
    votepct: .votepct,
    winner: .winner
}'

What I can't figure out is how to intercept each row so I can determine where the output should go. Is this possible?

When you say "append each row to a file", what do you mean by that? Are you wanting to edit an existing file in-place? Generate an output stream? Something else? — Charles Duffy
– Charles Duffy, Commented Sep 18, 2018 at 16:38
Making this a proper minimal reproducible example, with usable inputs (not fake data that isn't real JSON, but inputs someone can actually test a proposed answer with) and sample output data that would be generated by those inputs would be a big step towards having an answerable question. — Charles Duffy
– Charles Duffy, Commented Sep 18, 2018 at 16:38
(If you wanted, say, a separate output file per state, that's a different question, and an interesting one -- requires tools that aren't there in pure/baseline jq, but it's doable even so... but please edit the question so we can be completely certain as to exactly what it is you're asking for). — Charles Duffy
– Charles Duffy, Commented Sep 18, 2018 at 16:40
I want a separate output file per state, yes. I'll edit the question. Thanks for the suggestions. — Tyler
– Tyler, Commented Sep 18, 2018 at 16:47
Do see the "minimal" part of MCVE guidelines -- ideally, we want a sample simplified as much as possible while still being complete enough to let answers be tested without requiring changes. — Charles Duffy
– Charles Duffy, Commented Sep 18, 2018 at 17:00

Charles Duffy · Accepted Answer · 2018-09-18 20:15:16Z

You can do this with one copy of jq splitting out data items from the input file, and then another instance per state collating those data items together, with bash providing the glue. See the following example, for bash 4.2 or newer (might work with 4.1, I'd need to check).

#!/usr/bin/env bash
case $BASH_VERSION in ''|[123].*|4.[01].*) echo "ERROR: Bash 4.2 required" >&2; exit 1;; esac

input_file=$1
[[ -s $input_file ]] || { echo "Usage: ${0##*/} input-file" >&2; exit 1; }

jq_split_script='
# modify this function to fit your needs
def relevantContentOnly:
  { fipscode, level, polid, polnum, precinctsreporting, precinctsreportingpct, precinctstotal, raceid, runoff, statepostal, votecount, votepct, winner };

.[] | [.statename, (relevantContentOnly | tojson)] | @tsv
'

# Use an associative array to map from state names to output FDs
declare -A out_fds=( )

# Read state / line-of-data pairs from our JQ script...
while IFS=$'\t' read -r state data; do
  # If we don't already have a writer for the current state, start one.
  if [[ ! ${out_fds[$state]} ]]; then
    exec {new_fd}> >(jq -n '[inputs]' >"$state.json")
    out_fds[$state]=$new_fd
  fi
  # Regardless, send the data to the FD we have for this state
  printf '%s\n' "$data" >&${out_fds[$state]}
done < <(jq -rc "$jq_split_script" <"$input_file") # ...running the JQ script above.

# close output FDs, so the JQ instances all flush
for fd in "${!out_fds[@]}"; do
  exec {fd}>&-
done

peak · Accepted Answer · 2018-09-19 01:26:40Z

1

Here's a simple solution piggybacking on what you started with:

< master.json jq -cn --stream 'fromstream(1|truncate_stream(inputs))' |
  jq -cr '.statename, {
    fipscode,
    level,
    polid,
    polnum,
    precinctsreporting,
    precinctsreportingpct,
    precinctstotal,
    raceid,
    runoff,
    statepostal,
    votecount,
    votepct,
    winner
}' | while read -r statename && read -r object
do
  echo "$object" >> "$statename.json"
done

Note that this will append the objects to any existing "$statename.json" files.

With your [original] sample data, the above produces Arizona.json, Florida.json, and Oklahoma.json

Tweak

If the overhead in using echo is an issue, then you could use awk:

awk '
  fn!="" {print > fn; fn=""; next}
  {fn=$0 ".json";
   if (fns[fn]!=1){fns[fn]=1; print fn > "filenames.txt"}}'

Finale

Since you want these files to contain arrays of objects, you could then use jq -s to achieve the final results. I'd probably collect the filenames within the while loop (naively, e.g. echo "$statename.json" >> filenames.txt), and then use sponge:

sort -u filenames.txt | 
  while read -r fn ; do 
    jq -s . "$fn" | sponge "$fn"
  done

edited Sep 19, 2018 at 1:26

answered Sep 18, 2018 at 17:42

peak

119k21 gold badges185 silver badges218 bronze badges

3 Comments

Charles Duffy Over a year ago

I understand trying to avoid bash 4.x-isms, but re-opening the output file for every single line is pretty messy. Might use awk to replace the while loop at the end -- GNU awk, at least, will automatically maintain a cache of pre-opened FDs for different output files and reuse them appropriately rather than doing the repeated open/write/close thing as this bash code does.

peak Over a year ago

@CharlesDuffy - Sure, I was thinking of using awk at the end, but thought I'd focus on the main hurdle the OP seemed to be facing, thinking that as a percentage of total CPU time, the difference would probably be fairly small. Do you have any numbers?

Charles Duffy Over a year ago

I'd be wanting to look at wall-clock time, not CPU time; a lot is going to be I/O wait and syscall overhead, depending on filesystem details. Let's see -- the US has ~500,000 elected offices if we count state and local. Using best-of-3 timing, I get 0m59.324s wall-clock for time for ((i=0; i<500000; i++)); do echo >>test; done, and 9.885s for time for ((i=0; i<500000; i++)); do echo test; done >>test.

Collectives™ on Stack Overflow

jq: How can I pipe objects from array to different files based on data in object?

2 Answers 2

Comments

Tweak

Finale

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Tweak

Finale

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related