1

I have two different .csv files, but they have the same id colummn.

file_1.csv:
id, column1, column2
4543DFGD_werwe_23, string
4546476FGH34_wee_24, string
....
45sd234_w32rwe_2342342, string

The other one:

file_1.csv:
id, column3, column4
4543DFGD_werwe_23, bla bla bla
4546476FGH34_wee_24, bla bla bla
....
45sd234_w32rwe_2342342, bla bla bla

How can I verify that this two columns match(have the same id) or are the same with the csv module or with pandas?.

2 Answers 2

3

After loading you can call equals on the id column:

df['id'].equals(df1['id'])

This will return True of False if they are exactly the same, in length and same values in the same order

In [3]:

df = pd.DataFrame({'id':np.arange(10)})
df1 = pd.DataFrame({'id':np.arange(10)})
df.id.equals(df1.id)
Out[3]:
True

In [7]:

df = pd.DataFrame({'id':np.arange(10)})
df1 = pd.DataFrame({'id':[0,1,1,3,4,5,6,7,8,9]})
df.id.equals(df1.id)
Out[7]:
False
In [8]:

df.id == df1.id
Out[8]:
0     True
1     True
2    False
3     True
4     True
5     True
6     True
7     True
8     True
9     True
Name: id, dtype: bool

To load the csvs:

df = pd.read_csv('file_1.csv')
df1 = pd.read_csv('file_2.csv') # I'm assuming your real other csv is not the same name as file_1.csv

Then you can perform the same comparison as above:

df.id.equals(df1.id)

If you just want to compare the id columns you can specify just to load that column:

df = pd.read_csv('file_1.csv', usecols=['id'])
df1 = pd.read_csv('file_2.csv', usecols=['id'])
Sign up to request clarification or add additional context in comments.

14 Comments

Wow.. nice. Thanks for the help. How can I change the 'id':np.arange(10) for the lenght of a large file?.
You're a little confused, my code shows sample data, I will update to show how to load the csv in pandas and perform the same comparison
I can tell you that the pandas csv module is lightning fast at loading csv files, faster than the python standard csv module, see the link: wesmckinney.com/blog/…
@ml_guy No, ignore the np.arange portion, straight after loading the csv's just do df.id.equals(df1.id) no need to construct new dfs
pd.set_option('display.max_rows=-1')
|
1

By csv module:

  1. Open file both files.
  2. Reader file by csv reader() method.
  3. Create dictionary as first item from the row is key and value is row.
  4. Use set intersection method to get same keys from the dictionaries.
  5. Print result.

code:

import csv

file1 =  '/home/vivek/Desktop/stackoverflow/fil1.csv'
file2 =  '/home/vivek/Desktop/stackoverflow/fil2.csv'

with open(file1) as fp1:
    root = csv.reader(fp1)
    rows1 = {}
    for i in root:
        rows1[i[0]]=i
    if "id" in rows1:
        del rows1["id"]

with open(file2) as fp1:
    root = csv.reader(fp1)
    rows2 = {}
    for i in root:
        rows2[i[0]]=i
    if "id" in rows2:
        del rows2["id"]

result = set(rows1.keys()).intersection(set(rows2.keys()))

print "Same Id :", list(result)

output:

vivek@vivek:~/Desktop/stackoverflow$ python 27.py
Same Id : ['4546476FGH34_wee_24', '4543DFGD_werwe_23', '45sd234_w32rwe_2342342']

3 Comments

Welcome. I am also looking above pandas implementation
Thanks for the help but I got this:Same Id : [] maybe I am doing something wrong, how can I fix it?.
pass me ur py file with input files on email- [email protected]

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.