90

I am using python csvkit to compare 2 files like this:

df1 = pd.read_csv('input1.csv', sep=',\s+', delimiter=',', encoding="utf-8")
df2 = pd.read_csv('input2.csv', sep=',\s,', delimiter=',', encoding="utf-8")
df3 = pd.merge(df1,df2, on='employee_id', how='right')
df3.to_csv('output.csv', encoding='utf-8', index=False)

Currently I am running the file through a script before hand that strips spaces from the employee_id column.

An example of employee_ids:

37 78973 3
23787
2 22 3
123

Is there a way to get csvkit to do it and save me a step?

0

3 Answers 3

151

You can strip() an entire Series in Pandas using .str.strip():

df1['employee_id'] = df1['employee_id'].str.strip()
df2['employee_id'] = df2['employee_id'].str.strip()

This will remove leading/trailing whitespaces on the employee_id column in both df1 and df2

Alternatively, modify the read_csv lines to use skipinitialspace=True

df1 = pd.read_csv('input1.csv', sep=',\s+', delimiter=',', encoding="utf-8", skipinitialspace=True)
df2 = pd.read_csv('input2.csv', sep=',\s,', delimiter=',', encoding="utf-8", skipinitialspace=True)

It looks like you are attempting to remove spaces in a string containing numbers, which can be accomplished with pandas.Series.str.replace:

df1['employee_id'] = df1['employee_id'].str.replace(" ", "")
df2['employee_id'] = df2['employee_id'].str.replace(" ", "")
Sign up to request clarification or add additional context in comments.

Comments

37

You can do the strip() in pandas.read_csv() as:

pandas.read_csv(..., converters={'employee_id': str.strip})

And if you need to only strip leading whitespace:

pandas.read_csv(..., converters={'employee_id': str.lstrip})

And to remove all spaces:

def strip_spaces(a_str_with_spaces):
    return a_str_with_spaces.replace(' ', '')

pandas.read_csv(..., converters={'employee_id': strip_spaces})

2 Comments

Would it be pythonic if I used my own converter that either returns the result of str.strip or None? I'm importing data from Excel and it would be great if I could turn empty cells into None without additional steps, but I'm not sure whether this kind of magic is legal.
dataframe cells can be None if desired.
12
Df['employee']=Df['employee'].str.strip()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.