Python: Get Average values from multiple columns in multiple files

Question

I am trying to write a program which will take as input one or more files and summarize the average values coming from 2 columns in each file.

for example I have 2 files:

File1:

ID    Feature    Total    Percent
1.2    ABC    300    75
1.4    CDE    129    68

File2:

ID    Feature   Total    Percent
1.2    ABC    289    34
1.4    CDE    56    94

I want to iterate over each file and convert the percent to a number using:

def ReadFile(File):
    LineCount = 0
    f = open(File)
    Header =  f.readline()
    Lines = f.readlines()
    for Line in Lines:
        Info = Line.strip("\n").split("\t")
        ID, Feature, Total, Percent= Info[0], Info[1], int(Info[2]), int(Info[3])
        Num = (Percent/100.0)*Total

I'm not sure what's the best way to store the output so that I have the ID, Feature, Total and Percent for each file. Ultimately, I would like to create an outfile that contains the average percent over all files. In the above example I would get:

ID    Feature    AveragePercent
1.2    ABC    54.9    #(((75/100.0)*300)+((34/100.0)*289)) / (300+289))
1.4    CDE    75.9    #(((68/100.0)*129)+((94/100.0)*56)) / (129+56))

ojy · Accepted Answer · 2014-08-21 18:12:15Z

3

Pandas module is the way to go. Assuming that your files are named '1.txt' and '2.txt', the following code will store all your input, output, and intermediate computations in pandas' DataFrame instance df. Additionally, the information of interest will be printed to the file 'out.txt'.

import pandas as pd
import numpy as np

file_names = ['1.txt', '2.txt']
df = None

for f_name in file_names:
    df_tmp = pd.read_csv(f_name, sep = '\t') 
    df = df_tmp if df is None else pd.concat([df,df_tmp])

df['Absolute'] = df['Percent'] * df['Total'] 
df['Sum_Total'] = df.groupby('Feature')['Total'].transform(np.sum)
df['Sum_Absolute'] = df.groupby('Feature')['Absolute'].transform(np.sum)
df['AveragePercent'] =  df['Sum_Absolute'] / df['Sum_Total'] 

df_out = df[['ID','Feature','AveragePercent']].drop_duplicates()

df_out.to_csv('out.txt', sep = "\t", index = False)

edited Aug 21, 2014 at 18:12

answered Aug 21, 2014 at 18:06

ojy

2,4922 gold badges20 silver badges23 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Padraic Cunningham Over a year ago

just beat me to it! +1

user2165857 Over a year ago

@ojy - This sounds like a really great solution. Unfortunately I failed to mention that my actual files each contain over 1 million rows of data. When I try to run using the above solution it is exceedingly slow - been running for over an hour...

user3885927 Over a year ago

@user2165857, do you have an option to use a database instead of a file?

user2165857 Over a year ago

@user3885927 - I actually am writing to a database (I wrote file for simplicity), but that isn't the bottleneck step.

ojy Over a year ago

Which step in my code was slow? I just tried to simulate 10 mln records, and it took just a couple of seconds to run. How many distinct Feature values do you have approximately? (I made simulation with just 5)

|

GravityScore · Accepted Answer · 2014-08-21 17:50:00Z

You'll need to store some data across reading the files. Say you have a list of file paths in a variable called files

data = {}
for filepath in files:
  f = open(filepath, "r")
  f.readline()
  for line in f.readlines():
    info = line.strip().split("\t")
    id, feature, total, percent = info[0], info[1], int(info[2]), int(info[3])
    if id in data:
      data[id].total += total * (percent / 100.0)
      data[id].count += total
    else:
      data[id] = {"feature": feature, "total": total * (percent / 100.0), "count": total}

# Output
out = open("outfile", "w")
out.write("ID\tFeature\tAveragePercentage")
for id in data:
  out.write(str(id) + "\t" + data.feature + "\t" + str(data.total / data.count) + "\n")

Ashwini Chaudhary · Accepted Answer · 2014-08-21 17:43:01Z

A dictionary will be perfect for this.(I've left the header handling part for you)

import fileinput

data = {}
for line in fileinput.input(['file1', 'file2']):
    idx, ft, values = line.split(None, 2)
    key = idx, ft     #use ID, Feature tuple as a key.
    tot, per = map(int, values.split())
    if key not in data:
        data[key] = {'num': 0, 'den': 0}
    data[key]['num'] += (per/100.0) * tot
    data[key]['den'] += tot

Now data contains:

{('1.2', 'ABC'): {'num': 323.26, 'den': 589},
 ('1.4', 'CDE'): {'num': 140.36, 'den': 185}}

Now we can loop over this dict and calculate the desired result:

for (idx, ft), v in data.items():
    print idx, ft, round(v['num']/v['den']*100, 1)

Output:

1.2 ABC 54.9
1.4 CDE 75.9

Antoine Dahan · Accepted Answer · 2014-08-21 18:33:15Z

I have tested this using files with ID, Feature, Total, Percent deliminated with tabs (like your input file) and works great, giving output you want:

globalResultsFromReadDictionary = {}

def ReadFile(File):
    LineCount = 0
    f = open(File)
    Header =  f.readline()
    Lines = f.readlines()
    for Line in Lines:
        Info = Line.strip("\n").split("\t")
        ID, Feature, Total, Percent = Info[0], Info[1], int(Info[2]), int(Info[3])

        #Adding to dictionary
        key = ID + "\t" + Feature
        if(key in globalResultsFromReadDictionary):
            globalResultsFromReadDictionary[key].append([Total, Percent])
        else:
            globalResultsFromReadDictionary[key] = [[Total, Percent]]

def createFinalReport(File):
    overallReportFile = open(File, 'w'); #the file to write the report to

    overallReportFile.write('ID\tFeature\tAvg%\n') #writing the header

    for idFeatureCombinationKey in globalResultsFromReadDictionary:

        #Tallying up the total and sum of percent*total for each element of the Id-Feature combination
        sumOfTotals = 0
        sumOfPortionOfTotals = 0
        for totalPercentCombination in globalResultsFromReadDictionary[idFeatureCombinationKey]:
            sumOfTotals += totalPercentCombination[0]
            sumOfPortionOfTotals += (totalPercentCombination[0]*(totalPercentCombination[1]/100))

        #Write to the line (idFeatureCombinationKey is 'ID \t Feature', so can just write that)
        overallReportFile.write(idFeatureCombinationKey + '\t' + str(round((sumOfPortionOfTotals/sumOfTotals)*100, 1)) + '\n')

    overallReportFile.close()

#Calling the functions
ReadFile('File1.txt');
ReadFile('File2.txt');
createFinalReport('dd.txt');

Collectives™ on Stack Overflow

Python: Get Average values from multiple columns in multiple files

4 Answers 4

6 Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

6 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related