1

I'm trying to merge a single data column from 40 almost similar csv files with Pandas. The files contains info from windows processes in csv form generated by Windows 'Tasklist' command.

What I want to do is, to merge the memory information from these files into a single file by using the PID as the key. However there are some random insignificant processes that appear every now and then, but cause inconsistency among the csv files. Meaning that in some file there might be 65 rows and in some files 75 rows. However those random processes are not significant and their changing PID should not matter and they should also be dropped off when merging the files.

This is how I first tried to do it:

# CSV files have following columns
# Image Name, PID, Session Name, Session #, Mem Usage

file1 = pd.read_csv("tasklist1.txt")
file1 = file1.drop(file1.columns[[2,3]], axis=1)    

for i in range(2,41):

    filename = "tasklist" + str(i) + ".txt"

    filei = pd.read_csv(filename)
    filei = filei.drop(filei.columns[[0,2,3]], axis=1)

    file1 = file1.merge(filei, on='PID')


file1.to_csv("Final.txt", index=False)

From the first csv file I just drop the Session Name and Session # columns, but keep the Image Names just as the titles for each row. Then from the following csv files I just keep the PID and Mem Usage columns and try to merge the previous all the time growing csv file with the data from upcoming file.

The problem here is that when the loop comes to 5th iteration, it cannot merge the files anymore as I get the "Reindexing only valid with uniquely valued Index objects" error.

So I can merge 1st file with 2nd to 4th inside the first loop. If I then create second loop where I merge the 5th file to 6th to 8th file and then merge these two merged files together, all the data from files 1 to 8 will be merged just perfectly fine.

Any suggestion how to perform this kind of chained merge without creating x amount of additional loops? At this point I'm experimenting with 40 files and could actually go through the whole process by brute force this with nested loops, but that isn't effective way of merging in the first place and unacceptable, if I need to scale this to merge even more files.

3
  • Duplicate column names will cause this error, try eliminating duplicate column names. Commented Jan 13, 2016 at 13:14
  • Thanks! Adding suffixes=('_x'+str(i), '_y'+str(i)) seems to do the trick and eliminates easily duplicate column names. Never thought that the similar column names could raise that unique id value error as well. Commented Jan 13, 2016 at 13:38
  • Can I create answer from comment? Do you accept it? Commented Jan 13, 2016 at 13:39

1 Answer 1

1

Duplicate column names will cause this error.

So you can add parameter suffixes in function merge:

suffixes : 2-length sequence (tuple, list, ...)

Suffix to apply to overlapping column names in the left and right side, respectively

Overlapping value columns.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.