Return to Answer

added 1034 characters in body

Source Link

edited Sep 9, 2021 at 7:39

cas

84.9k
9
138
207

Finally, you also need to store the key ($1) that corresponds to each line-number (I'll use an array called linekeys for this, with the line-number as index and the key, $1, as the value). BTW, if the first file was so huge you have to process it a second time, then this array wouldn't be needed, as you can just get it from $1$1 as you process each line again. Technically, this array isn't really needed at all as you could split() it from lines[l] in the END{} block when you need it, but it's easier to do it this way - trading a bit more memory usage for simpler code and possibly faster run-time.

BTW, I'd recommend saving this in either a sh script as is (except using "$@" as the argument to awk instead of file1 file2 so you can specify the input lines on the command-line when you run it (e.g. as bash scriptname.sh file1 file2, OR saving it as an awk script (remove the awk command and the single-quotes and the filenames) so you can run it as awk -f scriptname.awk file1 file2. With an appropriate #! line as the first line of the script, you can also make it executable so you can run it directly without having to type the interpreter name on the commmand-line when you run it.

Or, if you really insist, you could squeeze the entire script onto one line - semi-colons have been left in place where needed between statements to allow for that. I wouldn't recommend it, though, as the shell command line is a terrible place to be editing scripts, even ones as short as this, and even with convenience features (like Ctrl-XCtrl-E in bash) to edit the current line in vi or your preferred editor.

Finally, you also need to store the key ($1) that corresponds to each line-number (I'll use an array called linekeys for this, with the line-number as index and the key, $1, as the value). BTW, if the first file was so huge you have to process it a second time, then this array wouldn't be needed, as you can just get it from $1 as you process each line again. Technically, this array isn't really needed at all as you could split() it from lines[l] in the END{} block when you need it, but it's easier to do it this way - trading a bit more memory usage for simpler code and possibly faster run-time.

Source Link

answered Sep 9, 2021 at 7:27

cas

84.9k
9
138
207

For this task, you need to store the value of $4 for each key ($1) of line 1 (in the script below, I'll use an array called keys for this, with $1 as the key and $4 as the value).

You also need to store each actual line in another array (I'll use lines for this, with the line-number as the key and the entire line as the value). Note that this can consume large amounts of memory if file1 is huge....but, unless it's enormous, is probably not a problem on any modern system with many gigabytes of RAM. If does happen to be too large to fit into RAM, the script would have to be modified to iterate through the first file a second time rather than store it in the lines array.

awk '# process the first file
     NR==FNR {
       keys[$1] = $4;      # remember the value of $4 for the key ($1)
       lines[FNR] = $0;    # store the entire line
       linekeys[FNR] = $1; # remember the key for that line
       next
     };

     # process any remaining file(s)
     $1 in keys {
       if ($2 < keys[$1]) {
         sum[$1]+=$3
       };
     };

     # All files have been processed, so print the output
     END {
       for (l in lines) {
         print lines[l], sum[linekeys[l]]
       }
     }' file1 file2
NC_000001.11_NM_001005484.2 69270   234 69037 9
NC_000001.11_NM_001005484.2 69511   475 69037 9
NC_000001.11_NM_001005484.2 69761   725 69037 9
NC_000001.11_NM_001385640.1 942155  20  942136 1361