Using awk on multiple input files

Question

There's a bash script I've been working on and within this script at some point, I have been trying to figure out how to process two CSV files at once using awk, which will be used to produce several output files. Shortly, there's a main file which keeps the content to be dispatched to some other output files whose names and the number of records they need to be hold, will be derived from another file. First n records will go to first output file and consequent n+1 to n+k to second one and so on.

To be more clear here's an example of how the main record file might look:

x11,x21
x12,x22
x13,x23
x14,x24
x15,x25
x16,x26
x17,x27
x18,x28
x19,x29

and how the other file might look like:

out_file_name_1,2
out_file_name_2,3
out_file_name_3,4

Then the first output file named as out_file_name_1 should look like:

x11,x21
x12,x22

Then the second output file named as out_file_name_2 should look like:

x13,x23
x14,x24
x15,x25

And the last one should look like:

x16,x26
x17,x27
x18,x28
x19,x29

Hopefully it is clear enough.

The description is quite vague. To get useful answers, you will probably need to spell everything out clearly. For example: "there's a main file which keeps the record of a content to be dispatched to some other output files whose names and number of records will be derived from another file." In what format is the "record of content" kept? Precisely how should it be "dispatched"? How will those names and numbers "be derived from another file"? For best results, show a small sample of all the required input files and the resulting output files. — John1024
– John1024, Commented Mar 13, 2015 at 0:50

tripleee · Accepted Answer · 2015-03-13 09:35:16Z

1

I wouldn't use Awk for this.

while IFS=, read -u 3 filename lines; do
    head -n "$lines" >"$filename"
done 3<other.csv <main.csv

The read -u to read from a particular file descriptor is not completely portable, I believe, but your question is tagged bash so I am assuming that is not a problem here.

Demo: http://ideone.com/6FisHT

If you end up with empty files after the first, maybe try to replace the inner loop with additional read statements.

while IFS=, read -u 3 filename lines; do
    for i in $(seq 1 "$lines"); do
        read -r line
        echo "$line"
    done >"$filename"
done 3<other.csv <main.csv

edited Mar 13, 2015 at 9:35

answered Mar 13, 2015 at 7:52

tripleee

192k37 gold badges318 silver badges368 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

jas Over a year ago

This seems like a great approach, but when I run it on OP's data the second two output files are empty. Is it different for you?

tripleee Over a year ago

Yeah, I tested it here before posting, and again now just to confirm; Bash 4.1.5(1)-release (x86_64-pc-linux-gnu), Debian Squeeze.

jas Over a year ago

Cool, I didn't imagine you'd post without verifying first. I'm on Mac OS X, bash 3.2.57. I think it boils down to (head -n 2; head -n 2) < main.csv only outputting two lines for me.

tripleee Over a year ago

You can work around that with read but it's kind of clunky. I'll update with a suggestion.

tripleee Over a year ago

Sure, it could be done. gnu.org/software/gawk/manual/html_node/Split-Program.html

|

jas · Accepted Answer · 2015-03-13 10:25:18Z

1

Here's a solution in awk since you asked, but clearly triplee's answer is the nicer approach.

$ cat oak.awk
BEGIN { FS = ","; fidx = 1 }

# Processing files.txt, init parallel arrays with filename and number of records
# to print to each one.
NR == FNR {
    file[NR] = $1
    records[NR] = $2
    next
}

# Processing main.txt. Print record to current file. Decrement number of records to print,
# advancing to the next file when number of records to print reaches 0
fidx in file && records[fidx] > 0 {
    print > file[fidx]
    if (! --records[fidx]) ++fidx
    next
}

# If we get here, either we ran out of files before reading all the records
# or a file was specified to contain zero records    
{ print "Error: Insufficient number of files or file with non-positive number of records"
  exit 1 }


$ cat files.txt
out_file_name_1,2
out_file_name_2,3
out_file_name_3,4

$ cat main.txt
x11,x21
x12,x22
x13,x23
x14,x24
x15,x25
x16,x26
x17,x27
x18,x28
x19,x29

$ awk -f oak.awk files.txt main.txt

$ cat out_file_name_1
x11,x21
x12,x22

$ cat out_file_name_2
x13,x23
x14,x24
x15,x25

$ cat out_file_name_3
x16,x26
x17,x27
x18,x28
x19,x29

edited Mar 13, 2015 at 10:25

answered Mar 13, 2015 at 2:19

jas

10.9k2 gold badges33 silver badges45 bronze badges

2 Comments

oakenshield1 Over a year ago

Yes, thank you. Actually this was the answer I was looking for but as @tripleee answered it in an elegant way, I agree with you to move on with his answer.

tripleee Over a year ago

You aren't closing open file handles, so you will run out when you have more than just a handful of files. Some Awk implementations are really constrained in this regard. It was the one problem I wanted to avoid by moving to shell script; but all things counted, it should not be a very major addition to this script (just close the old file when moving to the next one).

Collectives™ on Stack Overflow

Using awk on multiple input files

2 Answers 2

7 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related