0

I'm using Bash, and I have a directory of .tsv files containing different behavioral data (RT and accuracy) for different subjects and multiple sessions within the same subjects. My goal is to concatenate the RT field (in field 3 of each .tsv file) and the accuracy field (in field 9) across all these files into a single .tsv file, while adding the subject and session (defined based on the directory names) as new variables in this concatenated file every time I append a new file, so I can keep together the subject-session data with the RT and accuracy data.

To illustrate, each .tsv file has the following header in every row:

V1 V2 RT V4 V5 V6 V7 V8 ACC

I want to look through many of these files, extracting just the RT and ACC fields and adding the data in these fields to a new .tsv file with SUB and SES as new variables in a file called "summary.tsv":

SUB SES RT ACC

Here's the code I have so far:

subdir=~/path/to/subdir

for subs in ${subdir}/subject-*; do
          sub=$(basename ${subs})
          for sess in ${sub}/session-*; do
                    ses=$(basename ${ses})
                    for files in ${sess}/*.tsv; do
                              if [[ -e $files ]] && [[ -e ${outdir}/summary.tsv ]] ; then
                                        awk 'NR > 1 {print $3,$9}' ${files} >> ${outdir}/summary.tsv
                              fi
                              if [[ -e $files ]] && [[ ! -e ${outdir}/summary.tsv ]] ; then
                                        awk '{print $3,$9}' ${files} > ${outdir}/summary.tsv
                              fi
                    done
          done
done

This works fine to concatenate files into the summary.tsv file without repeating each file's header, but what I can't figure out is how to add 2 new variables with the same length as the appended output in the "awk 'NR > 1 {print $3,$9}' ${files} >> ${outdir}/summary.tsv" line, containing the corresponding ${sub} and ${ses} variables in the 1st and 2nd fields.

Any suggestions? Thank you so much in advance.

4
  • First of all special thanks for showing your efforts in your question. Request you to please keep the question more cut to short to your actual problematic part only, as of now its not fully clear. Kindly do make the changes in your question(post samples if needed in form of text too) and let us know then. Commented Sep 3, 2020 at 14:59
  • 1
    You seem to have forgotten to write awk in front of the Awk scripts! Commented Sep 3, 2020 at 15:01
  • The innermost if is unnecessary, you can append to a file which doesn't exist and then the shell will simply create it. Commented Sep 3, 2020 at 15:03
  • Thanks for taking a look at this! The innermost if I believe is necessary, because I only want to include the header containing variables if the summary.tsv file does not exist yet. If it already does, I want to omit the header. Commented Sep 3, 2020 at 15:06

1 Answer 1

1

Your script has a number of issues, but the answer to your actual question is

awk -v subj="$sub" -v ses="$ses" 'BEGIN { OFS="\t" }
    NR>1 { print subj, ses, $3, $9 }'

Awk can read many files so the innermost loop is unnecessary. Here is a tentative refactoring.

for subs in ~/path/to/subdir/subject-*; do
    sub=$(basename "$subs")
    for sess in "$sub"/session-*; do
         ses=$(basename "$ses")
         awk -v subj="$sub" -v ses="$ses" '
             BEGIN { OFS="\t" }
             FNR>1 { print subj, ses, $3, $9 }' \
                 "$sess"/*.tsv
    done
done >> "$outdir"/summary.tsv

I would recommend against having headers in the output file at all, but if you need a header line, writing one before the main script should be easy enough.

If your diectory structure is this simple (and you don't have hundreds of thousands of files, so that passing a single wildcard to Awk will not produce a "command line too long" error) you could probably simplify all the loops into a single Awk script. The current file name is in the FILENAME variable; pulling out the bottom two parent directories with a simple regex or split() should be straghtforward, too.

Sign up to request clarification or add additional context in comments.

3 Comments

This worked perfectly, thank you so much for solving this in a very clean way, and I appreciate the timeliness. A couple minor adjustments I needed to make (which you couldn't have known about) was to change sub --> subj since sub is a built-in variable for GAWK, and I needed to add the conditional checking whether the file exists, since some subjects were missing these files.
Thanks for the sub remark; I should have noticed that myself. (It's a function, not a variable.) Changed that now.
There's a problem with if [ -e whatever/*.tsv ] because you get a syntax error if the wildcard matches more than one file. Doing a single wildcard over all the files will trivially solve that, too.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.