1

I have a large array of objects stored in a master JSON file. I want to loop through that array, take each object, and append it to a new file based on a field in the object (in this case, the state name). In other words, in a set of data containing many states, I want to filter it out to a file for each state.

I'm using an existing JQ expression to filter for only the data I actually need:

{ fipscode: .fipscode, level: .level, polid: .polid, polnum: .polnum, precinctsreporting: .precinctsreporting, precinctsreportingpct: .precinctsreportingpct, precinctstotal: .precinctstotal, raceid: .raceid, runoff: .runoff, statepostal: .statepostal, votecount: .votecount, votepct: .votepct, winner: .winner }

Here's a sample of my input:

[
    { "ballotorder": 2, "candidateid": "9718", "delegatecount": 0, "description": null, "electiondate": "2018-08-28", "electtotal": 0, "electwon": 0, "fipscode": null, "first": "Doug", "id": "3015-polid-64364-state-AZ-1", "incumbent": true, "initialization_data": false, "is_ballot_measure": false, "last": "Ducey", "lastupdated": "2018-08-30T00:01:38.897Z", "level": "state", "national": true, "officeid": "G", "officename": "Governor", "party": "GOP", "polid": "64364", "polnum": "5554", "precinctsreporting": 1488, "precinctsreportingpct": 0.9993000000000001, "precinctstotal": 1489, "raceid": "3015", "racetype": "Primary", "racetypeid": "R", "reportingunitid": "state-AZ-1", "reportingunitname": null, "runoff": false, "seatname": null, "seatnum": null, "statename": "Arizona", "statepostal": "AZ", "test": false, "uncontested": false, "votecount": 355455, "votepct": 0.705493, "winner": true },
    { "ballotorder": 2, "candidateid": "21689", "delegatecount": 0, "description": null, "electiondate": "2018-08-28", "electtotal": 0, "electwon": 0, "fipscode": null, "first": "Ron", "id": "10046-polid-62557-state-FL-1", "incumbent": false, "initialization_data": false, "is_ballot_measure": false, "last": "DeSantis", "lastupdated": "2018-08-29T19:29:50.367Z", "level": "state", "national": true, "officeid": "G", "officename": "Governor", "party": "GOP", "polid": "62557", "polnum": "13918", "precinctsreporting": 5968, "precinctsreportingpct": 1.0, "precinctstotal": 5968, "raceid": "10046", "racetype": "Primary", "racetypeid": "R", "reportingunitid": "state-FL-1", "reportingunitname": null, "runoff": false, "seatname": null, "seatnum": null, "statename": "Florida", "statepostal": "FL", "test": false, "uncontested": false, "votecount": 913997, "votepct": 0.564728, "winner": true },
    { "ballotorder": 2, "candidateid": "45555", "delegatecount": 0, "description": null, "electiondate": "2018-08-28", "electtotal": 0, "electwon": 0, "fipscode": null, "first": "Rex", "id": "38538-polid-67011-state-OK-1", "incumbent": false, "initialization_data": false, "is_ballot_measure": false, "last": "Lawhorn", "lastupdated": "2018-08-29T02:44:44.610Z", "level": "state", "national": true, "officeid": "G", "officename": "Governor", "party": "Lib", "polid": "67011", "polnum": "40784", "precinctsreporting": 1951, "precinctsreportingpct": 1.0, "precinctstotal": 1951, "raceid": "38538", "racetype": "Runoff", "racetypeid": "L", "reportingunitid": "state-OK-1", "reportingunitname": null, "runoff": false, "seatname": null, "seatnum": null, "statename": "Oklahoma", "statepostal": "OK", "test": false, "uncontested": false, "votecount": 379, "votepct": 0.409287, "winner": false }
]

As output, I would expect to have a Arizona.json containing only the item(s) from that state, and also filtered to remove unwanted fields:

[
  { "fipscode": null, "level": "state", "polid": "64364", "polnum": "5554", "precinctsreporting": 1488, "precinctsreportingpct": 0.9993000000000001, "precinctstotal": 1489, "raceid": "3015", "runoff": false, "statepostal": "AZ", "votecount": 355455, "votepct": 0.705493, "winner": true }
]

...and likewise for the other states involved (Florida.json and Oklahoma.json).


Here's the bash and jq script I have so far:

cat master.json |
jq -cn --stream 'fromstream(1|truncate_stream(inputs))' |
jq -c '.statename as $state | {
    fipscode: .fipscode,
    level: .level,
    polid: .polid,
    polnum: .polnum,
    precinctsreporting: .precinctsreporting,
    precinctsreportingpct: .precinctsreportingpct,
    precinctstotal: .precinctstotal,
    raceid: .raceid,
    runoff: .runoff,
    statepostal: .statepostal,
    votecount: .votecount,
    votepct: .votepct,
    winner: .winner
}'

What I can't figure out is how to intercept each row so I can determine where the output should go. Is this possible?

6
  • When you say "append each row to a file", what do you mean by that? Are you wanting to edit an existing file in-place? Generate an output stream? Something else? Commented Sep 18, 2018 at 16:38
  • 1
    Making this a proper minimal reproducible example, with usable inputs (not fake data that isn't real JSON, but inputs someone can actually test a proposed answer with) and sample output data that would be generated by those inputs would be a big step towards having an answerable question. Commented Sep 18, 2018 at 16:38
  • (If you wanted, say, a separate output file per state, that's a different question, and an interesting one -- requires tools that aren't there in pure/baseline jq, but it's doable even so... but please edit the question so we can be completely certain as to exactly what it is you're asking for). Commented Sep 18, 2018 at 16:40
  • I want a separate output file per state, yes. I'll edit the question. Thanks for the suggestions. Commented Sep 18, 2018 at 16:47
  • 1
    Do see the "minimal" part of MCVE guidelines -- ideally, we want a sample simplified as much as possible while still being complete enough to let answers be tested without requiring changes. Commented Sep 18, 2018 at 17:00

2 Answers 2

1

You can do this with one copy of jq splitting out data items from the input file, and then another instance per state collating those data items together, with bash providing the glue. See the following example, for bash 4.2 or newer (might work with 4.1, I'd need to check).

#!/usr/bin/env bash
case $BASH_VERSION in ''|[123].*|4.[01].*) echo "ERROR: Bash 4.2 required" >&2; exit 1;; esac

input_file=$1
[[ -s $input_file ]] || { echo "Usage: ${0##*/} input-file" >&2; exit 1; }

jq_split_script='
# modify this function to fit your needs
def relevantContentOnly:
  { fipscode, level, polid, polnum, precinctsreporting, precinctsreportingpct, precinctstotal, raceid, runoff, statepostal, votecount, votepct, winner };

.[] | [.statename, (relevantContentOnly | tojson)] | @tsv
'

# Use an associative array to map from state names to output FDs
declare -A out_fds=( )

# Read state / line-of-data pairs from our JQ script...
while IFS=$'\t' read -r state data; do
  # If we don't already have a writer for the current state, start one.
  if [[ ! ${out_fds[$state]} ]]; then
    exec {new_fd}> >(jq -n '[inputs]' >"$state.json")
    out_fds[$state]=$new_fd
  fi
  # Regardless, send the data to the FD we have for this state
  printf '%s\n' "$data" >&${out_fds[$state]}
done < <(jq -rc "$jq_split_script" <"$input_file") # ...running the JQ script above.

# close output FDs, so the JQ instances all flush
for fd in "${!out_fds[@]}"; do
  exec {fd}>&-
done
Sign up to request clarification or add additional context in comments.

Comments

1

Here's a simple solution piggybacking on what you started with:

< master.json jq -cn --stream 'fromstream(1|truncate_stream(inputs))' |
  jq -cr '.statename, {
    fipscode,
    level,
    polid,
    polnum,
    precinctsreporting,
    precinctsreportingpct,
    precinctstotal,
    raceid,
    runoff,
    statepostal,
    votecount,
    votepct,
    winner
}' | while read -r statename && read -r object
do
  echo "$object" >> "$statename.json"
done

Note that this will append the objects to any existing "$statename.json" files.

With your [original] sample data, the above produces Arizona.json, Florida.json, and Oklahoma.json

Tweak

If the overhead in using echo is an issue, then you could use awk:

awk '
  fn!="" {print > fn; fn=""; next}
  {fn=$0 ".json";
   if (fns[fn]!=1){fns[fn]=1; print fn > "filenames.txt"}}'

Finale

Since you want these files to contain arrays of objects, you could then use jq -s to achieve the final results. I'd probably collect the filenames within the while loop (naively, e.g. echo "$statename.json" >> filenames.txt), and then use sponge:

sort -u filenames.txt | 
  while read -r fn ; do 
    jq -s . "$fn" | sponge "$fn"
  done

3 Comments

I understand trying to avoid bash 4.x-isms, but re-opening the output file for every single line is pretty messy. Might use awk to replace the while loop at the end -- GNU awk, at least, will automatically maintain a cache of pre-opened FDs for different output files and reuse them appropriately rather than doing the repeated open/write/close thing as this bash code does.
@CharlesDuffy - Sure, I was thinking of using awk at the end, but thought I'd focus on the main hurdle the OP seemed to be facing, thinking that as a percentage of total CPU time, the difference would probably be fairly small. Do you have any numbers?
I'd be wanting to look at wall-clock time, not CPU time; a lot is going to be I/O wait and syscall overhead, depending on filesystem details. Let's see -- the US has ~500,000 elected offices if we count state and local. Using best-of-3 timing, I get 0m59.324s wall-clock for time for ((i=0; i<500000; i++)); do echo >>test; done, and 9.885s for time for ((i=0; i<500000; i++)); do echo test; done >>test.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.