1

In this project I've converted a csv file to an xls file and a txt file to an xls file. The objective is to then compare both xls files for differences and print out any differences to a third excel file.

However, when the differences are printed they include any entry with an integer above 999, as any integer from my converted csv file is treated as a string instead of an integer. Therefore it treats a value such as 1,200 (in my converted xls file) differently from 1200 (in my converted txt file) due to the comma in the converted csv excel file.

My question is: Is there a way to convert the string interpreted integers, back to being interpreted as integers? Otherwise, is there a way to delete all commas from my xls files? I have tried the usual dataframe.replace methodology and it is ineffective.

Below is my code:

#import required libraries
import datetime
import xlrd
import pandas as pd

#define the time_handle function to name the outputted excel files
time_handle = datetime.datetime.now().strftime("%Y%m%d_%H%M")

#identify XM1 file paths (for both csv origin and excel destination)
XM1_csv = r"filepath"
XM2_excel = r"filepath" + time_handle + ".xlsx"

#identify XM2 file paths (for both txt origin and excel destination)
XM2_txt = r"filepath"
XM2_excel = r"filepath" + time_handle + ".xlsx"

#remove commas from XM1 excel - failed attempts
#XM1_excel = [col.replace(',', '') for col in XM1_excel]
#XM1_excel = XM1_excel.replace(",", "")
#for line in XM1_excel:
        #XM1_excel.write(line.replace(",", ""))

#remove commas from XM1 CSV - failed attempts
#XM1_csv = [col.replace(',', '') for col in XM1_csv]
#XM1_csv = XM1_csv.replace(",", "")
#for line in XM1_csv:
        #XM1_excel.write(line.replace(",", ""))

#convert the csv XM1 file to an excel file, in the same folder
pd.read_csv(XM1_csv).to_excel(XM1_excel)

#convert the txt XM2 file to an excel file in the same folder
pd.read_csv(XM2_txt, sep="|").to_excel(XM2_excel)



#confirm XM1 filepath
filepath_XM1 = XM1_excel

#confirm XM2 filepath
filepath_XM2 = XM2_excel
#read relevant columns from the excel files
df1 = pd.read_excel(filepath_XM2, sheetname="Sheet1", parse_cols= "H, J, M, U")
df2 = pd.read_excel(filepath_XM1, sheetname="Sheet1", parse_cols= "C, E, G, K")

#remove all commas from XM1 - failed attempts
#df2 = [col.replace(',', '') for col in df2]
#df2 = df2.replace(",", "")
#for line in df2:
        #df2.write(line.replace(",", ""))

#merge the columns from both excel files into one column each respectively
df4 = df1["Exchange Code"] + df1["Product Type"] + df1["Product Description"] + df1["Quantity"].apply(str)
df5 = df2["Exchange"] + df2["Product Type"] + df2["Product Description"] + df2["Quantity"].apply(str)

#concatenate both columns from each excel file, to make one big column containing all the data
df = pd.concat([df4, df5])

#remove all whitespace from each row of the column of data
df=df.str.strip()
df=["".join(x.split()) for x in df]

#convert the data to a dataframe from a series
df = pd.DataFrame({'Value': df})

#remove any duplicates
df.drop_duplicates(subset=None, keep=False, inplace=True)

#print to the console just as a visual aid
print(df)
#output_path = r"filepath"
#print the erroneous entries to an excel file
df.to_excel("XM1_XM2Comparison" + time_handle + ".xls")

Also, I realize the XM1 and XM2 file names with regards to df1 and df2 is a bit confusing, but I simply renamed my files. It makes sense in terms of the files and where they belong in the code!

Thank You

1
  • You should Post this as an Answer and Accept your own post. That way anyone stumbling upon this issue will be able to see your solution easier. Commented Oct 5, 2017 at 17:34

2 Answers 2

1

You can try an argument called converters on the read-end of the dataframe where you can specify the datatype. Example:

df= pd.read_excel(file, sheetname=YOUR_SHEET_HERE, converters={'FIELD_NAME': str})

converters is both in read_csv and read_excel

Sign up to request clarification or add additional context in comments.

Comments

0

I actually solved this issue with a simple fix for future reference. when reading the csv using pd.read_csv, I added the thousands method so it looks like this:

pd.read_csv(XM1, thousands = ",").to_excel(XM1_excel)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.