0

I have a list of ids in one file that I want to use to grep their information from a second file. I can only get my output to show only the information for the last id and I think I just can't figure out how to tweak my code a bit so that it outputs the info for each line, not the last one only.

my command:

for i in $(cat my_ids.txt); 
do 
    for name in $i; 
    do 
        class=$(grep -A 25 $name id_info.txt | grep -E "tf_class"); 
        family=$(grep -A 25 $name id_info.txt | grep -E "tf_family"); 
        echo -e "$name\n\class\n\family"; 
   done
done

I only get the last id's information lines that I need. I need it to show up for each ID and I don't know how else to tweak this. I also tried removing the second for loop but it was giving the exact same output.

Sample input from my_ids.txt:

MA0052.4
MA0602.1
MA0497.1
MA0786.1
MA0515.1

Sample input from id_info.txt

AC MA0052.4
XX
ID MEF2A
XX
DE MA0052.4 MEF2A ; From JASPAR
PO  A   C   G   T
01  5075.0  2119.0  3651.0  5317.0
02  4033.0  1960.0  4493.0  5676.0
03  1984.0  10919.0 1007.0  2252.0
04  627.0   2974.0  236.0   12325.0
05  12437.0 1013.0  1066.0  1646.0
06  13132.0 253.0   610.0   2167.0
07  14680.0 141.0   506.0   835.0
08  14453.0 231.0   241.0   1237.0
09  14956.0 173.0   202.0   831.0
10  441.0   349.0   215.0   15157.0
11  15582.0 50.0    422.0   108.0
12  2566.0  1060.0  11104.0 1432.0
13  7709.0  4039.0  1605.0  2809.0
14  6171.0  3523.0  1810.0  4658.0
15  5254.0  3812.0  2479.0  4617.0
XX
CC tax_group:vertebrates
CC tf_family:Regulators of differentiation
CC tf_class:MADS box factors
CC pubmed_ids:25217591
CC uniprot_ids:Q02078
CC data_type:ChIP-seq
AC MA0602.1
XX
ID Arid5a
XX
DE MA0602.1 Arid5a ; From JASPAR
PO  A   C   G   T
01  18.0    43.0    23.0    17.0
02  16.0    32.0    3.0 48.0
03  85.0    3.0 7.0 5.0
04  96.0    0.0 1.0 2.0
05  6.0 0.0 1.0 93.0
06  93.0    1.0 1.0 6.0
07  2.0 1.0 1.0 96.0
08  4.0 9.0 4.0 83.0
09  23.0    3.0 52.0    22.0
10  34.0    35.0    18.0    12.0
11  29.0    13.0    27.0    31.0
12  57.0    8.0 19.0    16.0
13  29.0    18.0    26.0    27.0
14  34.0    23.0    15.0    27.0
XX
CC tax_group:vertebrates
CC tf_family:ARID-related
CC tf_class:ARID
CC pubmed_ids:25215497
CC uniprot_ids:Q3U108
CC data_type:PBM
XX
AC MA0497.1
XX
ID MEF2C
XX
DE MA0497.1 MEF2C ; From JASPAR
PO  A   C   G   T
01  705.0   321.0   676.0   507.0
02  733.0   151.0   573.0   752.0
03  431.0   196.0   822.0   760.0
04  382.0   1412.0  78.0    337.0
05  0.0 985.0   0.0 1224.0
06  1616.0  256.0   74.0    263.0
07  1706.0  32.0    241.0   230.0
08  2107.0  0.0 87.0    15.0
09  2131.0  0.0 2.0 76.0
10  2135.0  0.0 4.0 70.0
11  56.0    62.0    0.0 2091.0
12  2177.0  0.0 32.0    0.0
13  389.0   120.0   1671.0  29.0
14  975.0   836.0   148.0   250.0
15  1009.0  450.0   126.0   624.0
XX
CC tax_group:vertebrates
CC tf_family:Regulators of differentiation
CC tf_class:MADS box factors
CC pubmed_ids:7559475
CC uniprot_ids:Q06413
CC data_type:ChIP-seq
XX
AC MA0786.1
XX
ID POU3F1
XX
DE MA0786.1 POU3F1 ; From JASPAR
PO  A   C   G   T
01  1034.0  126.0   322.0   1437.0
02  505.0   186.0   128.0   2471.0
03  2471.0  7.0 26.0    21.0
04  44.0    53.0    21.0    2471.0
05  37.0    13.0    2471.0  232.0
06  170.0   2471.0  413.0   1119.0
07  1423.0  1.0 21.0    1048.0
08  2471.0  103.0   130.0   284.0
09  2471.0  20.0    25.0    63.0
10  259.0   95.0    128.0   2471.0
11  382.0   302.0   620.0   1167.0
12  1510.0  478.0   452.0   961.0
XX
CC tax_group:vertebrates
CC tf_family:POU domain factors
CC tf_class:Homeo domain factors
CC pubmed_ids:1361172
CC uniprot_ids:Q03052
CC data_type:HT-SELEX
XX
AC MA0515.1
XX
ID Sox6
XX
DE MA0515.1 Sox6 ; From JASPAR
PO  A   C   G   T
01  4.0 139.0   50.0    56.0
02  0.0 221.0   0.0 28.0
03  161.0   0.0 0.0 88.0
04  0.0 0.0 0.0 249.0
05  0.0 0.0 0.0 249.0
06  0.0 0.0 249.0   0.0
07  0.0 0.0 0.0 249.0
08  0.0 115.0   5.0 129.0
09  4.0 112.0   0.0 133.0
10  14.0    76.0    31.0    128.0
XX
CC tax_group:vertebrates
CC tf_family:SOX-related factors
CC tf_class:High-mobility group (HMG) domain factors
CC pubmed_ids:21985497
CC uniprot_ids:P40645
CC data_type:ChIP-seq
XX

Example of the output I get when I run this as a bash script:

MA0052.4
MA0602.1
MA0497.1
MA0786.1
MA0515.1        CC tf_class:High-mobility group (HMG) domain factors    CC tf_family:SOX-related factors

Desired output:

 MA0602.1    CC ARID    CC ARID-related
 MA0497.1    CC MADS box factors    CC Regulators of differentiation
 MA0786.1    CC Homeo domain factors    CC POU domain factors
 MA0515.1    CC tf_class:High-mobility group (HMG) domain factors    CC tf_family:SOX-related factors

Another code snippet I tried but the output just gives me id names and nothing more; probably because I am messing up the syntax somehow (ran this in terminal):

while IFS= read -r line; do class=$(grep -A 25 $line id_infoc.txt | grep -E "tf_class"); family=$(grep -A 25 $line id_info.txt | grep -E "tf_family"); echo -e "$line\n\class\n\family"; done < my_ids.txt  
2
  • mywiki.wooledge.org/BashFAQ/001 Commented Jan 22, 2023 at 21:16
  • Didn't work - I have tried many versions of while read IFS and my output only gives me the lines (the id names) and no information as output. I'll add the code I tried for this as well in terminal. Commented Jan 22, 2023 at 21:20

2 Answers 2

1

Try this script:

#! /usr/bin/env bash

while read -r id; do
    name="$id"
    class=$( grep -A 25 "$name" id_info.txt | grep -E "tf_class")
    family=$(grep -A 25 "$name" id_info.txt | grep -E "tf_family")
    echo -e "${name}\n${class}\n${family}"
done <"my_ids.txt"
Sign up to request clarification or add additional context in comments.

2 Comments

why not just read -r name ?
You could do that. On my system, if I want to use the variable content writes by read command (here id) after the loop, I need to put it in another variable
0

Ignoring style, the bug in your code is that you use \family and \class instead of $family and $class.

Invoking grep multiple times as you do will be a bit inefficient if the file is large and there are many ids to check.

A straightforward solution in awk that only needs to read each file once might be:

awk '
    function do_print () {
        if (name in ids)
            printf("%s\n%s\n%s\n",name,class,family)
        name=family=class=""
    }

    # read ids into an array
    NR==FNR { ids[$0]; next }

    # start of a section
    /^AC / { do_print(); name=$2; next }

    # other candidate values found
    /^CC tf_family:/ { family=$0; next }
    /^CC tf_class:/ { class=$0; next }

    # maybe print final section
    END { do_print() }
' my_ids.txt id_info.txt

To filter out the tf_family:,etc, the regexes can be replaced by sub:

    sub(/^CC tf_family:/,"CC ") { family=$0; next }
    sub(/^CC tf_class:/,"CC ") { class=$0; next }

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.