UPDATE
I suspect that the input and desired output data I initially put in wasn't exactly the same as I what I have with respect to whitespace. I've now put new input data and desired output data.
I currently have a data set that looks like this:
INPUT
Hybridization REF TCGA-13-A5FT-01A-11D-A409-05 TCGA-13-A5FT-01A-11D-A409-05 TCGA-13-A5FT-01A-11D-A409-05 TCGA-13-A5FT-01A-11D-A409-05 TCGA-13-A5FU-01A-11D-A409-05 TCGA-13-A5FU-01A-11D-A409-05 TCGA-13-A5FU-01A-11D-A409-05 TCGA-13-A5FU-01A-11D-A409-05 TCGA-29-A5NZ-01A-11D-A409-05 TCGA-29-A5NZ-01A-11D-A409-05 TCGA-29-A5NZ-01A-11D-A409-05 TCGA-29-A5NZ-01A-11D-A409-05 TCGA-3P-A9WA-01A-11D-A409-05 TCGA-3P-A9WA-01A-11D-A409-05 TCGA-3P-A9WA-01A-11D-A409-05 TCGA-3P-A9WA-01A-11D-A409-05 TCGA-59-A5PD-01A-11D-A409-05 TCGA-59-A5PD-01A-11D-A409-05 TCGA-59-A5PD-01A-11D-A409-05 TCGA-59-A5PD-01A-11D-A409-05 TCGA-5X-AA5U-01A-11D-A409-05 TCGA-5X-AA5U-01A-11D-A409-05 TCGA-5X-AA5U-01A-11D-A409-05 TCGA-5X-AA5U-01A-11D-A409-05 TCGA-OY-A56P-01A-12D-A409-05 TCGA-OY-A56P-01A-12D-A409-05 TCGA-OY-A56P-01A-12D-A409-05 TCGA-OY-A56P-01A-12D-A409-05 TCGA-OY-A56Q-01A-11D-A409-05 TCGA-OY-A56Q-01A-11D-A409-05 TCGA-OY-A56Q-01A-11D-A409-05 TCGA-OY-A56Q-01A-11D-A409-05 TCGA-VG-A8LO-01A-11D-A409-05 TCGA-VG-A8LO-01A-11D-A409-05 TCGA-VG-A8LO-01A-11D-A409-05 TCGA-VG-A8LO-01A-11D-A409-05 TCGA-WR-A838-01A-12D-A409-05 TCGA-WR-A838-01A-12D-A409-05 TCGA-WR-A838-01A-12D-A409-05 TCGA-WR-A838-01A-12D-A409-05
Composite Element REF Beta_value Gene_Symbol Chromosome Genomic_Coordinate Beta_value Gene_Symbol Chromosome Genomic_Coordinate Beta_value Gene_Symbol Chromosome Genomic_Coordinate Beta_value Gene_Symbol Chromosome Genomic_Coordinate Beta_value Gene_Symbol Chromosome Genomic_Coordinate Beta_value Gene_Symbol Chromosome Genomic_Coordinate Beta_value Gene_Symbol Chromosome Genomic_Coordinate Beta_value Gene_Symbol Chromosome Genomic_Coordinate Beta_value Gene_Symbol Chromosome Genomic_Coordinate Beta_value Gene_Symbol Chromosome Genomic_Coordinate
cg00000029 0.162232896986279 RBL2 16 53468112 0.191627667901702 RBL2 16 53468112 0.0712181967886229 RBL2 16 53468112 0.0797617926225958 RBL2 16 53468112 0.134907151266991 RBL2 16 53468112 0.0541415985613948 RBL2 16 53468112 0.0898579298345672 RBL2 16 53468112 0.037865566345129 RBL2 16 53468112 0.0681542463965581 RBL2 16 53468112 0.101053013486289 RBL2 16 53468112
cg00000108 NA C3orf35 3 37459206 NA C3orf35 3 37459206 NA C3orf35 3 37459206 NA C3orf35 3 37459206 NA C3orf35 3 37459206 NA C3orf35 3 37459206 NA C3orf35 3 37459206 NA C3orf35 3 37459206 NA C3orf35 3 37459206 NA C3orf35 3 37459206
cg00000109 NA FNDC3B 3 171916037 NA FNDC3B 3 171916037 NA FNDC3B 3 171916037 NA FNDC3B 3 171916037 NA FNDC3B 3 171916037 NA FNDC3B 3 171916037 NA FNDC3B 3 171916037 NA FNDC3B 3 171916037 NA FNDC3B 3 171916037 NA FNDC3B 3 171916037
The data set is much bigger and is almost 10 GB in size. So too big to do in R for example.
However, alot of the columns are effectively duplicates. For example, I only need to keep one each of the columns titled (second row) Gene_Symbol, Chromosome and Genomic_Coordinate. The individual Beta_value columns need to stay because they are different for each sample. Sample IDs are on the first row. So an example desired output of the above is:
DESIRED OUTPUT
Hybridization REF Gene_Symbol Chromosome Genomic_Coordinate TCGA-13-A5FT-01A-11D-A409-05 TCGA-13-A5FU-01A-11D-A409-05 TCGA-29-A5NZ-01A-11D-A409-05 TCGA-3P-A9WA-01A-11D-A409-05 TCGA-59-A5PD-01A-11D-A409-05 TCGA-5X-AA5U-01A-11D-A409-05 TCGA-OY-A56P-01A-12D-A409-05 TCGA-OY-A56Q-01A-11D-A409-05 TCGA-VG-A8LO-01A-11D-A409-05 TCGA-WR-A838-01A-12D-A409-05
cg00000029 RBL2 16 53468112 0.162232897 0.191627668 0.071218197 0.079761793 0.134907151 0.054141599 0.08985793 0.037865566 0.068154246 0.101053013
cg00000108 C3orf35 3 37459206 NA NA NA NA NA NA NA NA NA NA
cg00000109 FNDC3B 3 171916037 NA NA NA NA NA NA NA NA NA NA
Note that I've shuffled the column headers in the first row to remove redundant information. The TCGA... headed columns are the sample identifiers. Note that hard coding the column numbers won't work since I need to iterate through multiple files all of which will have variable numbers of samples/columns.
What's the most efficient way of doing that with bash or with python?
EDIT:
Using John Zwinck's answer below but with \t in between each %s I am now getting the following error:
+ NR == 1 '{' ../../../methods/meth_ma_gene/awk_methreformat.sh: line 3: NR: command not found ../../../methods/meth_ma_gene/awk_methreformat.sh: line 5: syntax error near unexpected token `(' ../../../methods/meth_ma_gene/awk_methreformat.sh: line 5: ` for (i=1; i <= NF; i++) {'
I run the awk script in a bash script as follows:
#!/usr/bin/env bash
for f in TCGAOV*; do
sed 's/Hybridization REF/Hybridization_REF/; s/Composite Element REF/Composite_Element_REF/' "$f" > "final.$f"
bash -x ./../../methods/meth_ma_gene/awk_methreformat.sh "final.$f" > "final.$f"
done
where the referenced awk script is:
#!/usr/bin/awk -f
NR == 1 {
# collect sample names
for (i=1; i <= NF; i++) {
sample[i] = $i
}
}
NR == 2 {
# first four columns are always the same
cols[1] = 1
cols[2] = 3
cols[3] = 4
cols[4] = 5
printf "%s\t%s\t%s\t%s\t", sample[1], $3, $4, $5
# dynamic columns (in practice: 2,6,10,...)
for (i=1; i <= NF; i++) {
if ($i == "Beta_value") {
cols[length(cols)+1] = i
printf "%s\t", sample[i]
}
}
printf "\n"
}
NR >= 3 {
# print cols from data row
for (i=1; i <= length(cols); i++) {
printf "%s\t", $cols[i]
}
printf "\n"
}
I think this may be something to do with the spacing between the fields. Note that I've also amended the input above, since I missed out that there was actually no _ in between the words forming Hybridization REF and also those forming Composite Element REF