2

I am working with CSV files that are each hundreds of megabytes (800k+ rows), use pipe delimiters, and have 90 columns. What I need to do is compare two files at a time, generating a CSV file of any differences (i.e. rows in File2 that do not exist in File1 as well as rows in File1 that do not exist in File2) but performing the comparison Only using a single column.

For instance, a highly simplified version would be:

File1

claim_date|status|first_name|last_name|phone|claim_number
20200501|active|John|Doe|5555551212|ABC123
20200505|active|Jane|Doe|5555551212|ABC321

File2

claim_date|status|first_name|last_name|phone|claim_number
20200501|active|Someone|New|5555551212|ABC123
20200510|active|Another|Person|5555551212|ABC000

In this example, the output file should look like this:

claim_date|status|first_name|last_name|phone|claim_number
20200505|active|Jane|Doe|5555551212|ABC321
20200510|active|Another|Person|5555551212|ABC000

As this example shows, both input files contained the row with claim_number ABC123 and although the fields first_name and last_name changed between the files I do not care as the claim_number was the same in both files. The other rows contained unique claim_number values and so both were included in the output file.

I have been told that Pandas is the way to do this, so I have set up a Jupyter environment and am able to load the files correctly but am banging my head against the wall at this point. Any suggestions are Highly appreciated!

My code so far:

import os
import pandas as pd

df1 = pd.read_table("/Users/X/Claims_20210607.txt", sep='|', low_memory=False)
df2 = pd.read_table("/Users/X/Claims_20210618.txt", sep='|', low_memory=False)

Everything else I've written is basically irrelevant at this point as it's just copypasta from the web that doesn't execute for one reason or another.

EDIT: Solution!

import os
import pandas as pd

df1 = pd.read_table("Claims_20210607.txt", sep='|', low_memory=False)
df2 = pd.read_table("Claims_20210618.txt", sep='|', low_memory=False)

df1.astype({'claim_number':'str'})
df2.astype({'claim_number':'str'})

df = pd.concat([df1,df2])
(
    df.drop_duplicates(
        subset=['claim_number'],
        keep = False,
        ignore_index=True)
    .to_csv('diff.csv')
)

I still need to figure out how to kill off the first / added column before writing the file but this is fantastic! Thanks!

5
  • I attempted to follow the guide here: hackersandslackers.com/compare-rows-pandas-dataframes but when I run it the error generated is "ValueError: You are trying to merge on float64 and object columns. If you wish to proceed you should use pd.concat" Commented Jun 24, 2021 at 19:40
  • @Nk03 That seems to have generated a file that is basically All of the rows from both files Commented Jun 24, 2021 at 19:43
  • I don't think I understand your example. Wouldn't the output have three rows? There are three unique claim numbers. Commented Jun 24, 2021 at 19:51
  • @MarkBaker posted 1 code snippet as an answer which you can try. Commented Jun 24, 2021 at 19:52
  • @NickODell The first line in each file has claim_number 'ABC123' so that entire line is omitted from the output file. ABC000 only exists in File2 and ABC321 only exists in File1 so both are included in the output. I want the output to include All columns, but only rows where the claim_number exists in only one of the two files. Commented Jun 24, 2021 at 19:57

1 Answer 1

1

IIUC, you can try:

  1. If you wanna drop duplicates based on all columns except ['first_name', 'last_name']:
df = pd.concat([df1, df2])
(
    df.drop_duplicates(
        subset=df.columns.difference(['first_name', 'last_name']),
        keep=False)
    .to_csv('file3.csv')
)
  1. If you wanna drop duplicates based on duplicate claim_number column only:
df = pd.concat([df1,df2])
(
    df.drop_duplicates(
        subset=['claim_number'],
        keep = False)
    .to_csv('file3.csv')
)
Sign up to request clarification or add additional context in comments.

2 Comments

Okay, this seems to be getting me close! My files have 90 columns, so I replaced 'first_name'. 'last_name' with the 89 columns that I do not care about and it generated a file that has about a thousand rows which do indeed appear to be unique! The issue remaining is that it is not always treating the claim_number column as a string. For instance, a claim_number may look like 20099E10011 and from time to time the output file has scientific notation in it as though it interpereted the value as a number, i.e. one line in the output file contains a claim_number of 2.1158E+264
Your example #2 works very well! The same issue still exists re: interpreting some values as scientific notation and not string comparison. Thoughts on resolving that, as well as omitting the first column in the output (as it was not a column in the input files)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.