1

UPDATE

I suspect that the input and desired output data I initially put in wasn't exactly the same as I what I have with respect to whitespace. I've now put new input data and desired output data.

I currently have a data set that looks like this:

INPUT

Hybridization REF   TCGA-13-A5FT-01A-11D-A409-05    TCGA-13-A5FT-01A-11D-A409-05    TCGA-13-A5FT-01A-11D-A409-05    TCGA-13-A5FT-01A-11D-A409-05    TCGA-13-A5FU-01A-11D-A409-05    TCGA-13-A5FU-01A-11D-A409-05    TCGA-13-A5FU-01A-11D-A409-05    TCGA-13-A5FU-01A-11D-A409-05    TCGA-29-A5NZ-01A-11D-A409-05    TCGA-29-A5NZ-01A-11D-A409-05    TCGA-29-A5NZ-01A-11D-A409-05    TCGA-29-A5NZ-01A-11D-A409-05    TCGA-3P-A9WA-01A-11D-A409-05    TCGA-3P-A9WA-01A-11D-A409-05    TCGA-3P-A9WA-01A-11D-A409-05    TCGA-3P-A9WA-01A-11D-A409-05    TCGA-59-A5PD-01A-11D-A409-05    TCGA-59-A5PD-01A-11D-A409-05    TCGA-59-A5PD-01A-11D-A409-05    TCGA-59-A5PD-01A-11D-A409-05    TCGA-5X-AA5U-01A-11D-A409-05    TCGA-5X-AA5U-01A-11D-A409-05    TCGA-5X-AA5U-01A-11D-A409-05    TCGA-5X-AA5U-01A-11D-A409-05    TCGA-OY-A56P-01A-12D-A409-05    TCGA-OY-A56P-01A-12D-A409-05    TCGA-OY-A56P-01A-12D-A409-05    TCGA-OY-A56P-01A-12D-A409-05    TCGA-OY-A56Q-01A-11D-A409-05    TCGA-OY-A56Q-01A-11D-A409-05    TCGA-OY-A56Q-01A-11D-A409-05    TCGA-OY-A56Q-01A-11D-A409-05    TCGA-VG-A8LO-01A-11D-A409-05    TCGA-VG-A8LO-01A-11D-A409-05    TCGA-VG-A8LO-01A-11D-A409-05    TCGA-VG-A8LO-01A-11D-A409-05    TCGA-WR-A838-01A-12D-A409-05    TCGA-WR-A838-01A-12D-A409-05    TCGA-WR-A838-01A-12D-A409-05    TCGA-WR-A838-01A-12D-A409-05
Composite Element REF   Beta_value  Gene_Symbol Chromosome  Genomic_Coordinate  Beta_value  Gene_Symbol Chromosome  Genomic_Coordinate  Beta_value  Gene_Symbol Chromosome  Genomic_Coordinate  Beta_value  Gene_Symbol Chromosome  Genomic_Coordinate  Beta_value  Gene_Symbol Chromosome  Genomic_Coordinate  Beta_value  Gene_Symbol Chromosome  Genomic_Coordinate  Beta_value  Gene_Symbol Chromosome  Genomic_Coordinate  Beta_value  Gene_Symbol Chromosome  Genomic_Coordinate  Beta_value  Gene_Symbol Chromosome  Genomic_Coordinate  Beta_value  Gene_Symbol Chromosome  Genomic_Coordinate
cg00000029  0.162232896986279   RBL2    16  53468112    0.191627667901702   RBL2    16  53468112    0.0712181967886229  RBL2    16  53468112    0.0797617926225958  RBL2    16  53468112    0.134907151266991   RBL2    16  53468112    0.0541415985613948  RBL2    16  53468112    0.0898579298345672  RBL2    16  53468112    0.037865566345129   RBL2    16  53468112    0.0681542463965581  RBL2    16  53468112    0.101053013486289   RBL2    16  53468112
cg00000108  NA  C3orf35 3   37459206    NA  C3orf35 3   37459206    NA  C3orf35 3   37459206    NA  C3orf35 3   37459206    NA  C3orf35 3   37459206    NA  C3orf35 3   37459206    NA  C3orf35 3   37459206    NA  C3orf35 3   37459206    NA  C3orf35 3   37459206    NA  C3orf35 3   37459206
cg00000109  NA  FNDC3B  3   171916037   NA  FNDC3B  3   171916037   NA  FNDC3B  3   171916037   NA  FNDC3B  3   171916037   NA  FNDC3B  3   171916037   NA  FNDC3B  3   171916037   NA  FNDC3B  3   171916037   NA  FNDC3B  3   171916037   NA  FNDC3B  3   171916037   NA  FNDC3B  3   171916037

The data set is much bigger and is almost 10 GB in size. So too big to do in R for example.

However, alot of the columns are effectively duplicates. For example, I only need to keep one each of the columns titled (second row) Gene_Symbol, Chromosome and Genomic_Coordinate. The individual Beta_value columns need to stay because they are different for each sample. Sample IDs are on the first row. So an example desired output of the above is:

DESIRED OUTPUT

Hybridization REF   Gene_Symbol Chromosome  Genomic_Coordinate  TCGA-13-A5FT-01A-11D-A409-05    TCGA-13-A5FU-01A-11D-A409-05    TCGA-29-A5NZ-01A-11D-A409-05    TCGA-3P-A9WA-01A-11D-A409-05    TCGA-59-A5PD-01A-11D-A409-05    TCGA-5X-AA5U-01A-11D-A409-05    TCGA-OY-A56P-01A-12D-A409-05    TCGA-OY-A56Q-01A-11D-A409-05    TCGA-VG-A8LO-01A-11D-A409-05    TCGA-WR-A838-01A-12D-A409-05
cg00000029  RBL2    16  53468112    0.162232897 0.191627668 0.071218197 0.079761793 0.134907151 0.054141599 0.08985793  0.037865566 0.068154246 0.101053013
cg00000108  C3orf35 3   37459206    NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
cg00000109  FNDC3B  3   171916037   NA  NA  NA  NA  NA  NA  NA  NA  NA  NA

Note that I've shuffled the column headers in the first row to remove redundant information. The TCGA... headed columns are the sample identifiers. Note that hard coding the column numbers won't work since I need to iterate through multiple files all of which will have variable numbers of samples/columns.

What's the most efficient way of doing that with bash or with python?

EDIT:

Using John Zwinck's answer below but with \t in between each %s I am now getting the following error:

+ NR == 1 '{' ../../../methods/meth_ma_gene/awk_methreformat.sh: line 3: NR: command not found ../../../methods/meth_ma_gene/awk_methreformat.sh: line 5: syntax error near unexpected token `(' ../../../methods/meth_ma_gene/awk_methreformat.sh: line 5: ` for (i=1; i <= NF; i++) {'

I run the awk script in a bash script as follows:

#!/usr/bin/env bash

for f in TCGAOV*; do  

    sed 's/Hybridization REF/Hybridization_REF/; s/Composite Element REF/Composite_Element_REF/' "$f" > "final.$f"

    bash -x ./../../methods/meth_ma_gene/awk_methreformat.sh "final.$f" > "final.$f"

done

where the referenced awk script is:

#!/usr/bin/awk -f                                                                                                                                                               
NR == 1 {
    # collect sample names                                                                                                                                                               
    for (i=1; i <= NF; i++) {
        sample[i] = $i
    }
}

NR == 2 {
    # first four columns are always the same                                                                                                                                             
    cols[1] = 1
    cols[2] = 3
    cols[3] = 4
    cols[4] = 5
    printf "%s\t%s\t%s\t%s\t", sample[1], $3, $4, $5

    # dynamic columns (in practice: 2,6,10,...)                                                                                                                                          
    for (i=1; i <= NF; i++) {
        if ($i == "Beta_value") {
            cols[length(cols)+1] = i
            printf "%s\t", sample[i]
        }
    }
    printf "\n"
}

NR >= 3 {
    # print cols from data row                                                                                                                                                           
    for (i=1; i <= length(cols); i++) {
        printf "%s\t", $cols[i]
    }
    printf "\n"
}

I think this may be something to do with the spacing between the fields. Note that I've also amended the input above, since I missed out that there was actually no _ in between the words forming Hybridization REF and also those forming Composite Element REF

6
  • Since you included python in the question, have you tried using Pandas? Commented Apr 18, 2016 at 23:34
  • No, should I? I've looked it up just now and it seems to be that pandas offers data analysis functionality rather than data formatting Commented Apr 19, 2016 at 0:09
  • 2
    It can generally be useful for large matrix manipulation operations. Data sanitization is a subset of data analysis Commented Apr 19, 2016 at 0:10
  • If I understood your question, you're trying to get the input 'cleaned' by removing redundant information on both the columns and rows, and I assume later you'll save it in some sort of a db, correct? Commented Apr 19, 2016 at 0:17
  • Yes that's correct. I'll probably save it as a tab-delimited text file first though, so that's the immediate output Commented Apr 19, 2016 at 0:28

1 Answer 1

1

You don't really want to load the input data into memory, because it's so large. Instead, a streaming approach will be faster, and for this awk is well suited:

#!/usr/bin/awk -f

BEGIN {
    FS = "\t";
    OFS = FS;
}

NR == 1 {
    # collect sample names                                                                                                                                                               
    for (i=1; i <= NF; i++) {
        sample[i] = $i
    }
}

NR == 2 {
    # first four columns are always the same                                                                                                                                             
    cols[1] = 1
    cols[2] = 3
    cols[3] = 4
    cols[4] = 5
    printf "%s %s %s %s ", sample[1], $3, $4, $5

    # dynamic columns (in practice: 2,6,10,...)                                                                                                                                          
    for (i=1; i <= NF; i++) {
        if ($i == "Beta_value") {
            cols[length(cols)+1] = i
            printf "%s ", sample[i]
        }
    }
    printf "\n"
}

NR >= 3 {
    # print cols from data row                                                                                                                                                           
    for (i=1; i <= length(cols); i++) {
        printf "%s ", $cols[i]
    }
    printf "\n"
}

This gives your desired output. If you want more speed, you might consider using awk simply to print the column numbers (which only requires reading the two header rows), then cut to actually print them. This will be faster because no interpreted code needs to run for each data row. For the sample data in the question, the cut command you need to print all the data rows is something like this:

cut -d '\t' -f 1,3,4,5,2,6
Sign up to request clarification or add additional context in comments.

12 Comments

Looks good but when I run the script and output the data, it isn't tab-formatted. What would allow me to do that in the code above? I can't say I understand much of it but I'll break it down into pieces and try to understand
@user3313178: If you want tabs, just change e.g. printf "%s " to printf "%s\t".
I'm finding that use of the code above is giving me a blank page. One possible reason is that a lot of the files I have are Hybridization REF instead of Hybridization_REF. Likewise, Composite Element REF instead of Composite_Element_REF. None of the other column headers differ. What may be the issue here?
So ignoring my comment above asically, when I now call this from a script I am getting: + NR == 1 '{' ../../../methods/meth_ma_gene/awk_methreformat.sh: line 3: NR: command not found ../../../methods/meth_ma_gene/awk_methreformat.sh: line 5: syntax error near unexpected token `(' ../../../methods/meth_ma_gene/awk_methreformat.sh: line 5: ` for (i=1; i <= NF; i++) {'
@user3313178: Exactly how are you calling the above awk program from your script? It looks like you are running it as a shell script but it is an awk script.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.