0

I have two large csv files and I want to compare column1 in csv1 with column1 in csv2. I was able to do this using Python List where I read csv1 and throw column1 in list1, do the same thing with csv2 and then check to see if element in list1 is present in list2

olist = []
def oldList(self):
    for row in self.csvreaderOld:
        self.olist.append(row[1])

nlist = []
def newList(self):
    for row in self.csvreaderNew:
        self.nlist.append(row[1])

def new_list(self):
    return [item for item in self.olist if item not in self.nlist]

the code works but can a long time to complete. I am trying to see if I can use dictionary instead, see if that would be faster, so I can compare keys in dictionary1 exist in dictionary2 but so far havent been successfully owing to my limited knowledge.

3 Answers 3

1

If it's a big CSV file or your'e planning to continue working with tables, I would suggest doing it with the Pandas module.

To be honest, even if it's a small file, or you're not going to continue working with tables, Pandas is an excellent module.

From what I understand (and I might be mistaken), for reading CSV files, Pandas is one of the quickest libraries to do so.

import pandas as pd

df = pd.read_csv("path to your csv file", use_cols = ["column1", "column2"])

def new_list(df):
    return [item for item in df["column2"].values if item not in df["column1"].values]

It's important to use .values when checking for an item in a pandas series (when you're extracting a column in a DataFrame you're getting a pandas series)

You could also use list(df["column1"]) and the other methods suggested in How to determine whether a Pandas Column contains a particular value for determining whether a value is contains in a pandas column

for example :

df = pd.DataFrame({"column1":[1,2,3,4], "column2":[2,3,4,5]})

the data frame would be

column1   column2
1   2
2   3
3   4
4   5

and new_line would return [5]

Sign up to request clarification or add additional context in comments.

2 Comments

Hi @'Guy vandam', you have to read both files, according to the question requirement.
Hey @ItzikChaimov, You're right, I got confused with the column and file numbering. It's a similar answer, you can read 2 dataframes, df1 and df2 and get column1 from each. also, you should use usecols= instead of columns= when reading from a csv in pandas. My bad..
0

You can read both files into objects and compare in a single loop. Here is a short code snippet for the idea (not class implementation):

fsOld = open('oldFile.csv', 'r')
fsNew = open('newFile.csv', 'r')
fsLinesOld = fsOld.readlines()
fsLinesNew = fsNew.readlines()
outList = []

# assumes lines are same for both files data:
for i in range(0, fsLinesOld.__len__(), 1):
    if ( fsLinesOld[i] == fsLinesNew[i]):
        outList.append(fsLinesOld[i])

Comments

0

First of all, change the way of reading the CSV files, if you want just one column mention that in usecols, like this

df = pd.read_csv("sample_file.csv", usecols=col_list)

And second, you can use set difference if you are not comparing row to row, like this

set(df.col.to_list()).difference(set(df2.col.to_list()))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.