0

I am trying to write a program which will take as input one or more files and summarize the average values coming from 2 columns in each file.

for example I have 2 files:

File1:

ID    Feature    Total    Percent
1.2    ABC    300    75
1.4    CDE    129    68

File2:

ID    Feature   Total    Percent
1.2    ABC    289    34
1.4    CDE    56    94

I want to iterate over each file and convert the percent to a number using:

def ReadFile(File):
    LineCount = 0
    f = open(File)
    Header =  f.readline()
    Lines = f.readlines()
    for Line in Lines:
        Info = Line.strip("\n").split("\t")
        ID, Feature, Total, Percent= Info[0], Info[1], int(Info[2]), int(Info[3])
        Num = (Percent/100.0)*Total

I'm not sure what's the best way to store the output so that I have the ID, Feature, Total and Percent for each file. Ultimately, I would like to create an outfile that contains the average percent over all files. In the above example I would get:

ID    Feature    AveragePercent
1.2    ABC    54.9    #(((75/100.0)*300)+((34/100.0)*289)) / (300+289))
1.4    CDE    75.9    #(((68/100.0)*129)+((94/100.0)*56)) / (129+56))

4 Answers 4

3

Pandas module is the way to go. Assuming that your files are named '1.txt' and '2.txt', the following code will store all your input, output, and intermediate computations in pandas' DataFrame instance df. Additionally, the information of interest will be printed to the file 'out.txt'.

import pandas as pd
import numpy as np

file_names = ['1.txt', '2.txt']
df = None

for f_name in file_names:
    df_tmp = pd.read_csv(f_name, sep = '\t') 
    df = df_tmp if df is None else pd.concat([df,df_tmp])

df['Absolute'] = df['Percent'] * df['Total'] 
df['Sum_Total'] = df.groupby('Feature')['Total'].transform(np.sum)
df['Sum_Absolute'] = df.groupby('Feature')['Absolute'].transform(np.sum)
df['AveragePercent'] =  df['Sum_Absolute'] / df['Sum_Total'] 

df_out = df[['ID','Feature','AveragePercent']].drop_duplicates()

df_out.to_csv('out.txt', sep = "\t", index = False)
Sign up to request clarification or add additional context in comments.

6 Comments

just beat me to it! +1
@ojy - This sounds like a really great solution. Unfortunately I failed to mention that my actual files each contain over 1 million rows of data. When I try to run using the above solution it is exceedingly slow - been running for over an hour...
@user2165857, do you have an option to use a database instead of a file?
@user3885927 - I actually am writing to a database (I wrote file for simplicity), but that isn't the bottleneck step.
Which step in my code was slow? I just tried to simulate 10 mln records, and it took just a couple of seconds to run. How many distinct Feature values do you have approximately? (I made simulation with just 5)
|
1

You'll need to store some data across reading the files. Say you have a list of file paths in a variable called files

data = {}
for filepath in files:
  f = open(filepath, "r")
  f.readline()
  for line in f.readlines():
    info = line.strip().split("\t")
    id, feature, total, percent = info[0], info[1], int(info[2]), int(info[3])
    if id in data:
      data[id].total += total * (percent / 100.0)
      data[id].count += total
    else:
      data[id] = {"feature": feature, "total": total * (percent / 100.0), "count": total}

# Output
out = open("outfile", "w")
out.write("ID\tFeature\tAveragePercentage")
for id in data:
  out.write(str(id) + "\t" + data.feature + "\t" + str(data.total / data.count) + "\n")

Comments

1

A dictionary will be perfect for this.(I've left the header handling part for you)

import fileinput

data = {}
for line in fileinput.input(['file1', 'file2']):
    idx, ft, values = line.split(None, 2)
    key = idx, ft     #use ID, Feature tuple as a key.
    tot, per = map(int, values.split())
    if key not in data:
        data[key] = {'num': 0, 'den': 0}
    data[key]['num'] += (per/100.0) * tot
    data[key]['den'] += tot

Now data contains:

{('1.2', 'ABC'): {'num': 323.26, 'den': 589},
 ('1.4', 'CDE'): {'num': 140.36, 'den': 185}}

Now we can loop over this dict and calculate the desired result:

for (idx, ft), v in data.items():
    print idx, ft, round(v['num']/v['den']*100, 1)

Output:

1.2 ABC 54.9
1.4 CDE 75.9

Comments

1

I have tested this using files with ID, Feature, Total, Percent deliminated with tabs (like your input file) and works great, giving output you want:

globalResultsFromReadDictionary = {}

def ReadFile(File):
    LineCount = 0
    f = open(File)
    Header =  f.readline()
    Lines = f.readlines()
    for Line in Lines:
        Info = Line.strip("\n").split("\t")
        ID, Feature, Total, Percent = Info[0], Info[1], int(Info[2]), int(Info[3])

        #Adding to dictionary
        key = ID + "\t" + Feature
        if(key in globalResultsFromReadDictionary):
            globalResultsFromReadDictionary[key].append([Total, Percent])
        else:
            globalResultsFromReadDictionary[key] = [[Total, Percent]]

def createFinalReport(File):
    overallReportFile = open(File, 'w'); #the file to write the report to

    overallReportFile.write('ID\tFeature\tAvg%\n') #writing the header

    for idFeatureCombinationKey in globalResultsFromReadDictionary:

        #Tallying up the total and sum of percent*total for each element of the Id-Feature combination
        sumOfTotals = 0
        sumOfPortionOfTotals = 0
        for totalPercentCombination in globalResultsFromReadDictionary[idFeatureCombinationKey]:
            sumOfTotals += totalPercentCombination[0]
            sumOfPortionOfTotals += (totalPercentCombination[0]*(totalPercentCombination[1]/100))

        #Write to the line (idFeatureCombinationKey is 'ID \t Feature', so can just write that)
        overallReportFile.write(idFeatureCombinationKey + '\t' + str(round((sumOfPortionOfTotals/sumOfTotals)*100, 1)) + '\n')

    overallReportFile.close()

#Calling the functions
ReadFile('File1.txt');
ReadFile('File2.txt');
createFinalReport('dd.txt');

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.