Extracting Specific Columns from Multiple Files & Writing to File Python

Question

I have seven tab delimited files, each file has the exact number and name of the columns but different data of each. Below is a sample of how either of the seven files looks like:

 test_id gene_id gene    locus   sample_1        sample_2        status  value_1 value_2 log2(fold_change)
  000001     000001     ZZ 1:1   01  01   NOTEST  0       0       0       0       1       1       no

I am trying to basically read all of those seven files and extract the third, fourth and tenth column (gene, locus, log2(fold_change)) And write those columns in a new file. So the file look something like this:

gene name   locus   log2(fold_change)    log2(fold_change)    log2(fold_change)    log2(fold_change)    log2(fold_change)    log2(fold_change)    log2(fold_change)
ZZ  1:1         0     0     0     0

all the log2(fold_change) are obtain from the tenth column from each of the seven files

What I had so far is this and need help constructing a more efficient pythonic way to accomplish the task above, note that the code is still not accomplish the task explained above, need some work

 dicti = defaultdict(list)
 filetag = []

 def read_data(file, base):
  with open(file, 'r') as f:
    reader = csv.reader((f), delimiter='\t')
     for row in reader:
      if 'test_id' not in row[0]:
            dicti[row[2]].append((base, row))

 name_of_fold = raw_input("Folder name to stored output files in: ")
 for file in glob.glob("*.txt"):
  base=file[0:3]+"-log2(fold_change)"
  filetag.append(base)
  read_data(file, base)


 with open ("output.txt", "w") as out:
  out.write("gene name" + "\t"+  "locus" + "\t" + "\t".join(sorted(filetag))+"\n")
  for k,v in dicti:
   out.write(k + "\t" + v[1][1][3] + "\t" + "".join([ int(z[0][0:3]) * "\t" + z[1][9]  for z in v ])+"\n")

So, the code above is a working code but is not what I am looking for here is why. The output code is the issue, I am writing a tab delimited output file with the gene at the first column (k), v[1][1][3] is the locus of that particular gene, and finally which is what I am having tough time coding is this is part of the output file:

 "".join([ int(z[0][0:3]) * "\t" + z[1][9]  for z in v ])

I am trying to provide a list of fold change from each of the seven file at that particular gene and locus and then write it to the correct column number, so I am basically multiply the column number of which file number is by "\t" this will insure that the value will go to the right column, the problem is that when the next column of another file comes a long, the writing will be starting from where it left off from writing which I don't want, I want to start again from the beginning of the writing:

Here is what I mean for instance,

 gene name   locus     log2(fold change) from file 1    .... log2(fold change) from file7 
 ZZ           1:3      0           
                             0

because first log2 will be recorded based on the column number for instance 2 and that is to ensure recording, I am multiplying the number of column (2) by "\t" and fold_change value , it will record it no problem but then last column will be the seventh for instance and will not record to the seven because the last writing was done.

If the script isn't working correctly, please can you add what the error you're getting is or what specifically isn't working to the question so that answers can focus on that. In regards to making it more Pythonic, that question is probably better asked over on Code Review, but I'd wait until this code is functioning properly first as they require working code. See the Code Review help pages for more information on asking questions over there. — Matthew Champion
– Matthew Champion, Commented May 6, 2016 at 9:45
Hi @MattChampion I added what you asked for, please let me know if it makes sense, thanks a lot — aBiologist
– aBiologist, Commented May 6, 2016 at 10:02

Darius · Accepted Answer · 2016-05-21 23:48:03Z

3

+25

Here is my first approach:

import glob
import numpy as np

with open('output.txt', 'w') as out:
    fns = glob.glob('*.txt') # Here you can change the pattern of the file (e.g. 'file_experiment_*.txt')
    # Title row:
    titles = ['gene_name', 'locus'] + [str(file + 1) + '_log2(fold_change)' for file in range(len(fns))]
    out.write('\t'.join(titles) + '\n')
    # Data row:
    data = []
    for idx, fn in enumerate(fns):
        file = np.genfromtxt(fn, skip_header=1, usecols=(2, 3, 9), dtype=np.str, autostrip=True)
        if idx == 0:
            data.extend([file[0], file[1]])
        data.append(file[2])
    out.write('\t'.join(data))

Content of the created file output.txt (Note: I created just three files for testing):

gene_name   locus   1_log2(fold_change) 2_log2(fold_change) 3_log2(fold_change)
ZZ  1:1 0   0   0

answered May 21, 2016 at 23:48

Darius

12.4k2 gold badges33 silver badges51 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Kaladin · Accepted Answer · 2016-05-19 15:39:03Z

2

I am using re instead of csv. The main problem with you code is the for loop which writes the output in the file. I am writing the complete code. Hope this solves problem you have.

import collections
import glob
import re
dicti = collections.defaultdict(list)
filetag = []

def read_data(file, base):
  with open(file, 'r') as f:
    for row in f:
      r = re.compile(r'([^\s]*)\s*')
      row = r.findall(row.strip())[:-1]
      print row
      if 'test_id' not in row[0]:
        dicti[row[2]].append((base, row))

def main():
  name_of_fold = raw_input("Folder name to stored output files in: ")
  for file in glob.glob("*.txt"):
    base=file[0:3]+"-log2(fold_change)"
    filetag.append(base)
    read_data(file, base)

  with open ("output", "w") as out:
    data = ("genename" + "\t"+  "locus" + "\t" + "\t".join(sorted(filetag))+"\n")
    r = re.compile(r'([^\s]*)\s*')
    data = r.findall(data.strip())[:-1]
    out.write('{0[1]:<30}{0[2]:<30}{0[3]:<30}{0[4]:<30}{0[5]:<30}    {0[6]:<30}{0[7]:<30}{0[8]:<30}'.format(data))
    out.write('\n')
    for key in dicti:
      print 'locus = ' + str(dicti[key][1])
      data = (key + "\t" + dicti[key][1][1][3] + "\t" + "".join([     len(z[0][0:3]) * "\t" + z[1][9]  for z in dicti[key] ])+"\n")
      data = r.findall(data.strip())[:-1]
      out.write('{0[0]:<30}{0[1]:<30}{0[2]:<30}{0[3]:<30}{0[4]:<30}{0[5]:<30}{0[6]:<30}{0[7]:<30}{0[8]:<30}'.format(data))
      out.write('\n')

if __name__ == '__main__':
  main()

and i change the name of the output file from output.txt to output as the former may interrupt the code as code considers all .txt files. And I am attaching the output i got which i assume the format that you wanted. Thanks

gene name   locus   1.t-log2(fold_change)   2.t-log2(fold_change)    3.t-log2(fold_change)  4.t-log2(fold_change)   5.t-log2(fold_change)   6.t-log2(fold_change)   7.t-log2(fold_change)
ZZ  1:1             0           0           0           0           0           0           0

edited May 19, 2016 at 15:39

answered May 19, 2016 at 6:40

Kaladin

3212 silver badges11 bronze badges

5 Comments

aBiologist Over a year ago

Still the values are not matched with the columns corresponding to the file number. Some values go beyond the actual number of columns

Kaladin Over a year ago

can you show me what output your talking about 'Some values go beyond the actual number of columns'

aBiologist Over a year ago

Can I add a screenshot image here ?

aBiologist Over a year ago

gene name locus 002-log2(fold_change) 003-log2(fold_change) 005-log2(fold_change) 006-log2(fold_change) 007-log2(fold_change) 008-log2(fold_change) 009-log2(fold_change) 010-log2(fold_change) 011-log2(fold_change) 012-log2(fold_change) 013-log2(fold_change) 015-log2(fold_change) 016-log2(fold_change) 017-log2(fold_change) 018-log2(fold_change) RP11-433M22.1 chr17:46210801-46507637 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

aBiologist Over a year ago

The aligning is still a problem when I open the file in text file for instance, the are not aligned well and it is not tab delimted file which I was asking to get as file format

Alexander · Accepted Answer · 2016-05-23 22:03:13Z

Remember to append \n to the end of each line to create a line break. This method is very memory efficient, as it just processes one row at a time.

import csv
import os
import glob

# Your folder location where the input files are saved.
name_of_folder = '...'  
output_filename = 'output.txt'
input_files = glob.glob(os.path.join(name_of_folder, '*.txt'))

with open(os.path.join(name_of_folder, output_filename), 'w') as file_out:
    headers_read = False
    for input_file in input_files:
        if input_file == os.path.join(name_of_folder, output_filename):
            # If the output file is in the list of input files, ignore it.
            continue
        with open(input_file, 'r') as fin:
            reader = csv.reader(fin)
            if not headers_read:
                # Read column headers just once
                headers = reader.next()[0].split()
                headers = headers[2:4] + [headers[9]]
                file_out.write("\t".join(headers + ['\n']))  # Zero based indexing.
                headers_read = True
            else:
                _ = reader.next()  # Ignore header row.
            for line in reader:
                if line:  # Ignore blank lines.
                    line_out = line[0].split()
                    file_out.write("\t".join(line_out[2:4] + [line_out[9]] + ['\n']))

>>> !cat output.txt
gene    locus   log2(fold_change)   
ZZ  1:1 0   
ZZ  1:1 0

Collectives™ on Stack Overflow

Extracting Specific Columns from Multiple Files & Writing to File Python

3 Answers 3

Comments

5 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related