How to compare two columns in two CSV's using dictionary?

Question

I have two large csv files and I want to compare column1 in csv1 with column1 in csv2. I was able to do this using Python List where I read csv1 and throw column1 in list1, do the same thing with csv2 and then check to see if element in list1 is present in list2

olist = []
def oldList(self):
    for row in self.csvreaderOld:
        self.olist.append(row[1])

nlist = []
def newList(self):
    for row in self.csvreaderNew:
        self.nlist.append(row[1])

def new_list(self):
    return [item for item in self.olist if item not in self.nlist]

the code works but can a long time to complete. I am trying to see if I can use dictionary instead, see if that would be faster, so I can compare keys in dictionary1 exist in dictionary2 but so far havent been successfully owing to my limited knowledge.

Guy Van-Dam · Accepted Answer · 2023-01-14 09:18:00Z

1

If it's a big CSV file or your'e planning to continue working with tables, I would suggest doing it with the Pandas module.

To be honest, even if it's a small file, or you're not going to continue working with tables, Pandas is an excellent module.

From what I understand (and I might be mistaken), for reading CSV files, Pandas is one of the quickest libraries to do so.

import pandas as pd

df = pd.read_csv("path to your csv file", use_cols = ["column1", "column2"])

def new_list(df):
    return [item for item in df["column2"].values if item not in df["column1"].values]

It's important to use .values when checking for an item in a pandas series (when you're extracting a column in a DataFrame you're getting a pandas series)

You could also use list(df["column1"]) and the other methods suggested in How to determine whether a Pandas Column contains a particular value for determining whether a value is contains in a pandas column

for example :

df = pd.DataFrame({"column1":[1,2,3,4], "column2":[2,3,4,5]})

the data frame would be

column1   column2
1   2
2   3
3   4
4   5

and new_line would return [5]

edited Jan 14, 2023 at 9:18

answered May 24, 2021 at 12:59

Guy Van-Dam

436 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Itzik Chaimov Over a year ago

Hi @'Guy vandam', you have to read both files, according to the question requirement.

Guy Van-Dam Over a year ago

Hey @ItzikChaimov, You're right, I got confused with the column and file numbering. It's a similar answer, you can read 2 dataframes, df1 and df2 and get column1 from each. also, you should use usecols= instead of columns= when reading from a csv in pandas. My bad..

Itzik Chaimov · Accepted Answer · 2021-05-24 12:35:30Z

0

You can read both files into objects and compare in a single loop. Here is a short code snippet for the idea (not class implementation):

fsOld = open('oldFile.csv', 'r')
fsNew = open('newFile.csv', 'r')
fsLinesOld = fsOld.readlines()
fsLinesNew = fsNew.readlines()
outList = []

# assumes lines are same for both files data:
for i in range(0, fsLinesOld.__len__(), 1):
    if ( fsLinesOld[i] == fsLinesNew[i]):
        outList.append(fsLinesOld[i])

answered May 24, 2021 at 12:35

Itzik Chaimov

991 silver badge10 bronze badges

Comments

Pawan Jain · Accepted Answer · 2021-05-24 12:43:43Z

0

First of all, change the way of reading the CSV files, if you want just one column mention that in usecols, like this

df = pd.read_csv("sample_file.csv", usecols=col_list)

And second, you can use set difference if you are not comparing row to row, like this

set(df.col.to_list()).difference(set(df2.col.to_list()))

answered May 24, 2021 at 12:43

Pawan Jain

8266 silver badges15 bronze badges

Collectives™ on Stack Overflow

How to compare two columns in two CSV's using dictionary?

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related