Skip to main content
added 100 characters in body
Source Link
Shred
  • 133
  • 7

got this script to work against a file, composed by lots of line (>500Mb) with this scheme:

odd lines: >BLA_BLA lenght_XX cov.XX
even lines: AGCAGCAGACTCAGACTACAGAT  # on even lines there's a DNA sequence

Its function is to recalc value after "cov." using parameters passed by arguments and replace the older one and calc the percent amount of "G" and "C" into the DNA seq, in even lines.

So, output looks like:

> BLA_BLA lenght_XX
> nucleotidic_cov XX
> DNA seq (the same of even lines)
> GC_CONT: XX

Here's the code (only the loop):

K=$(($READLENGHT - $KMER + 1))
Y=$(echo "scale=4; $K / $READLENGHT" | bc)

while read odd; do
    echo -n "${odd##}" | cut -d "_" -f 1,2,3,4 && printf "nucleotide_cov: " 
    echo "scale=4;${odd##*_} / $Y" | bc 
    read even
    echo "${even##}" &&
    ACOUNT=$(echo "${even##}" |  sed -e "s/./&\n /g" | grep -c "A")  
    GCOUNT=$(echo "${even##}" |  sed -e "s/./&\n /g" | grep -c "G")
    CCOUNT=$(echo "${even##}" |  sed -e "s/./&\n /g" | grep -c "C")
    TCOUNT=$(echo "${even##}" |  sed -e "s/./&\n /g" | grep -c "T")
    TOTALBASES=$(($ACOUNT+$GCOUNT+$CCOUNT+$TCOUNT))
    GCCONT=$(($GCOUNT+$CCOUNT))
    printf "GC_CONT: " 
    echo "scale=2;$GCCONT / $TOTALBASES *100" | bc  
done < "$1"

It's incredibly slow when runs against huge text file (bigger than 500Mb) on a 16 core server. Any idea on how to increase speed of this script?

EDIT

As requested, desidered I/O provided via pastebin: https://pastebin.com/FY0Z7kUW

got this script to work against a file, composed by lots of line (>500Mb) with this scheme:

odd lines: >BLA_BLA lenght_XX cov.XX
even lines: AGCAGCAGACTCAGACTACAGAT  # on even lines there's a DNA sequence

Its function is to recalc value after "cov." using parameters passed by arguments and replace the older one and calc the percent amount of "G" and "C" into the DNA seq, in even lines.

So, output looks like:

> BLA_BLA lenght_XX
> nucleotidic_cov XX
> DNA seq (the same of even lines)
> GC_CONT: XX

Here's the code (only the loop):

K=$(($READLENGHT - $KMER + 1))
Y=$(echo "scale=4; $K / $READLENGHT" | bc)

while read odd; do
    echo -n "${odd##}" | cut -d "_" -f 1,2,3,4 && printf "nucleotide_cov: " 
    echo "scale=4;${odd##*_} / $Y" | bc 
    read even
    echo "${even##}" &&
    ACOUNT=$(echo "${even##}" |  sed -e "s/./&\n /g" | grep -c "A")  
    GCOUNT=$(echo "${even##}" |  sed -e "s/./&\n /g" | grep -c "G")
    CCOUNT=$(echo "${even##}" |  sed -e "s/./&\n /g" | grep -c "C")
    TCOUNT=$(echo "${even##}" |  sed -e "s/./&\n /g" | grep -c "T")
    TOTALBASES=$(($ACOUNT+$GCOUNT+$CCOUNT+$TCOUNT))
    GCCONT=$(($GCOUNT+$CCOUNT))
    printf "GC_CONT: " 
    echo "scale=2;$GCCONT / $TOTALBASES *100" | bc  
done < "$1"

It's incredibly slow when runs against huge text file (bigger than 500Mb) on a 16 core server. Any idea on how to increase speed of this script?

got this script to work against a file, composed by lots of line (>500Mb) with this scheme:

odd lines: >BLA_BLA lenght_XX cov.XX
even lines: AGCAGCAGACTCAGACTACAGAT  # on even lines there's a DNA sequence

Its function is to recalc value after "cov." using parameters passed by arguments and replace the older one and calc the percent amount of "G" and "C" into the DNA seq, in even lines.

So, output looks like:

> BLA_BLA lenght_XX
> nucleotidic_cov XX
> DNA seq (the same of even lines)
> GC_CONT: XX

Here's the code (only the loop):

K=$(($READLENGHT - $KMER + 1))
Y=$(echo "scale=4; $K / $READLENGHT" | bc)

while read odd; do
    echo -n "${odd##}" | cut -d "_" -f 1,2,3,4 && printf "nucleotide_cov: " 
    echo "scale=4;${odd##*_} / $Y" | bc 
    read even
    echo "${even##}" &&
    ACOUNT=$(echo "${even##}" |  sed -e "s/./&\n /g" | grep -c "A")  
    GCOUNT=$(echo "${even##}" |  sed -e "s/./&\n /g" | grep -c "G")
    CCOUNT=$(echo "${even##}" |  sed -e "s/./&\n /g" | grep -c "C")
    TCOUNT=$(echo "${even##}" |  sed -e "s/./&\n /g" | grep -c "T")
    TOTALBASES=$(($ACOUNT+$GCOUNT+$CCOUNT+$TCOUNT))
    GCCONT=$(($GCOUNT+$CCOUNT))
    printf "GC_CONT: " 
    echo "scale=2;$GCCONT / $TOTALBASES *100" | bc  
done < "$1"

It's incredibly slow when runs against huge text file (bigger than 500Mb) on a 16 core server. Any idea on how to increase speed of this script?

EDIT

As requested, desidered I/O provided via pastebin: https://pastebin.com/FY0Z7kUW

added 26 characters in body
Source Link
Shred
  • 133
  • 7

got this script to work against a file, composed by lots of line (>500Mb) with this scheme:

odd lines: >BLA_BLA lenght_XX cov.XX even lines: AGCAGCAGACTCAGACTACAGAT # on even lines there's a DNA sequence

odd lines: >BLA_BLA lenght_XX cov.XX
even lines: AGCAGCAGACTCAGACTACAGAT  # on even lines there's a DNA sequence

Its function is to recalc value after "cov." using parameters passed by arguments and replace the older one and calc the percent amount of "G" and "C" into the DNA seq, in even lines.

So, output looks like:

BLA_BLA lenght_XX nucleotidic_cov XX DNA seq (the same of even lines) GC_CONT: XX

> BLA_BLA lenght_XX
> nucleotidic_cov XX
> DNA seq (the same of even lines)
> GC_CONT: XX

Here's the code (only the loop):

K=$(($READLENGHT - $KMER + 1))
Y=$(echo "scale=4; $K / $READLENGHT" | bc)

while read odd; do
    echo -n "${odd##}" | cut -d "_" -f 1,2,3,4 && printf "nucleotide_cov: " 
    echo "scale=4;${odd##*_} / $Y" | bc 
    read even
    echo "${even##}" &&
    ACOUNT=$(echo "${even##}" |  sed -e "s/./&\n /g" | grep -c "A")  
    GCOUNT=$(echo "${even##}" |  sed -e "s/./&\n /g" | grep -c "G")
    CCOUNT=$(echo "${even##}" |  sed -e "s/./&\n /g" | grep -c "C")
    TCOUNT=$(echo "${even##}" |  sed -e "s/./&\n /g" | grep -c "T")
    TOTALBASES=$(($ACOUNT+$GCOUNT+$CCOUNT+$TCOUNT))
    GCCONT=$(($GCOUNT+$CCOUNT))
    printf "GC_CONT: " 
    echo "scale=2;$GCCONT / $TOTALBASES *100" | bc  
done < "$1"

It's incredibly slow when runs against huge text file (bigger than 500Mb) on a 16 core server. Any idea on how to increase speed of this script?

got this script to work against a file, composed by lots of line (>500Mb) with this scheme:

odd lines: >BLA_BLA lenght_XX cov.XX even lines: AGCAGCAGACTCAGACTACAGAT # on even lines there's a DNA sequence

Its function is to recalc value after "cov." using parameters passed by arguments and replace the older one and calc the percent amount of "G" and "C" into the DNA seq, in even lines.

So, output looks like:

BLA_BLA lenght_XX nucleotidic_cov XX DNA seq (the same of even lines) GC_CONT: XX

Here's the code (only the loop):

K=$(($READLENGHT - $KMER + 1))
Y=$(echo "scale=4; $K / $READLENGHT" | bc)

while read odd; do
    echo -n "${odd##}" | cut -d "_" -f 1,2,3,4 && printf "nucleotide_cov: " 
    echo "scale=4;${odd##*_} / $Y" | bc 
    read even
    echo "${even##}" &&
    ACOUNT=$(echo "${even##}" |  sed -e "s/./&\n /g" | grep -c "A")  
    GCOUNT=$(echo "${even##}" |  sed -e "s/./&\n /g" | grep -c "G")
    CCOUNT=$(echo "${even##}" |  sed -e "s/./&\n /g" | grep -c "C")
    TCOUNT=$(echo "${even##}" |  sed -e "s/./&\n /g" | grep -c "T")
    TOTALBASES=$(($ACOUNT+$GCOUNT+$CCOUNT+$TCOUNT))
    GCCONT=$(($GCOUNT+$CCOUNT))
    printf "GC_CONT: " 
    echo "scale=2;$GCCONT / $TOTALBASES *100" | bc  
done < "$1"

It's incredibly slow when runs against huge text file (bigger than 500Mb) on a 16 core server. Any idea on how to increase speed of this script?

got this script to work against a file, composed by lots of line (>500Mb) with this scheme:

odd lines: >BLA_BLA lenght_XX cov.XX
even lines: AGCAGCAGACTCAGACTACAGAT  # on even lines there's a DNA sequence

Its function is to recalc value after "cov." using parameters passed by arguments and replace the older one and calc the percent amount of "G" and "C" into the DNA seq, in even lines.

So, output looks like:

> BLA_BLA lenght_XX
> nucleotidic_cov XX
> DNA seq (the same of even lines)
> GC_CONT: XX

Here's the code (only the loop):

K=$(($READLENGHT - $KMER + 1))
Y=$(echo "scale=4; $K / $READLENGHT" | bc)

while read odd; do
    echo -n "${odd##}" | cut -d "_" -f 1,2,3,4 && printf "nucleotide_cov: " 
    echo "scale=4;${odd##*_} / $Y" | bc 
    read even
    echo "${even##}" &&
    ACOUNT=$(echo "${even##}" |  sed -e "s/./&\n /g" | grep -c "A")  
    GCOUNT=$(echo "${even##}" |  sed -e "s/./&\n /g" | grep -c "G")
    CCOUNT=$(echo "${even##}" |  sed -e "s/./&\n /g" | grep -c "C")
    TCOUNT=$(echo "${even##}" |  sed -e "s/./&\n /g" | grep -c "T")
    TOTALBASES=$(($ACOUNT+$GCOUNT+$CCOUNT+$TCOUNT))
    GCCONT=$(($GCOUNT+$CCOUNT))
    printf "GC_CONT: " 
    echo "scale=2;$GCCONT / $TOTALBASES *100" | bc  
done < "$1"

It's incredibly slow when runs against huge text file (bigger than 500Mb) on a 16 core server. Any idea on how to increase speed of this script?

Source Link
Shred
  • 133
  • 7

Increase speed of Bash script which used grep into a while loop

got this script to work against a file, composed by lots of line (>500Mb) with this scheme:

odd lines: >BLA_BLA lenght_XX cov.XX even lines: AGCAGCAGACTCAGACTACAGAT # on even lines there's a DNA sequence

Its function is to recalc value after "cov." using parameters passed by arguments and replace the older one and calc the percent amount of "G" and "C" into the DNA seq, in even lines.

So, output looks like:

BLA_BLA lenght_XX nucleotidic_cov XX DNA seq (the same of even lines) GC_CONT: XX

Here's the code (only the loop):

K=$(($READLENGHT - $KMER + 1))
Y=$(echo "scale=4; $K / $READLENGHT" | bc)

while read odd; do
    echo -n "${odd##}" | cut -d "_" -f 1,2,3,4 && printf "nucleotide_cov: " 
    echo "scale=4;${odd##*_} / $Y" | bc 
    read even
    echo "${even##}" &&
    ACOUNT=$(echo "${even##}" |  sed -e "s/./&\n /g" | grep -c "A")  
    GCOUNT=$(echo "${even##}" |  sed -e "s/./&\n /g" | grep -c "G")
    CCOUNT=$(echo "${even##}" |  sed -e "s/./&\n /g" | grep -c "C")
    TCOUNT=$(echo "${even##}" |  sed -e "s/./&\n /g" | grep -c "T")
    TOTALBASES=$(($ACOUNT+$GCOUNT+$CCOUNT+$TCOUNT))
    GCCONT=$(($GCOUNT+$CCOUNT))
    printf "GC_CONT: " 
    echo "scale=2;$GCCONT / $TOTALBASES *100" | bc  
done < "$1"

It's incredibly slow when runs against huge text file (bigger than 500Mb) on a 16 core server. Any idea on how to increase speed of this script?