pandas and trouble with duplicate dates

Question

I have a csv file with the following sample output:

3/12/1970
3/1/1942
10/20/1945  10/20/1945
10/27/1960
10/5/1952

I bring it into pandas with df = pd.read_csv(filename).

I know there are rows with double dates as noted above. The dtype of this column is object in pandas. When trying to convert this column to datetime format in pandas, I get errors on all the rows with this double date issue and have to find and edit them in the csv, one by one. So, I have tried the following to clean out all the rows in my 50K rows which have this double date issue:

df[col] = df[col].str.strip()
df[col] = df[col].str[:10]

Does not affect any of the double dates at all.

I also tried to calculate the length of each value in the col and then simply remove date values if the resulting col length exceeds 10. Still, the double date rows remain.

I have also tried the following to even locate this particular row to inspect it further, but this code results in 0 rows.

bad_dates = df[df[col].str.contains('10/20/1945')]

So, any creative ideas to clean these double dates? (It happens with probably one hundred randomly distributed column values)

both attempts you did worked normally for me. Are you sure you are using col as the correct column name? Anyway, using .str[:10] may not be the best solution for you since dates can have one or two digits month and day. Maybe you can try to use split(' ') or regex (here an example: stackoverflow.com/questions/46064162/…). — Flavio Moraes
– Flavio Moraes, Commented Nov 21, 2020 at 5:43

wheezay · Accepted Answer · 2020-11-21 11:02:59Z

1

you can use split to do this.

split() splits each row (str) into a list of values split by spaces then [-1] selects the last value only, this eliminates all the extra values and retains only single value as you need.

df['col'].apply(lambda x: x.split()[-1])

edited Nov 21, 2020 at 11:02

answered Nov 21, 2020 at 5:56

wheezay

1017 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Joe Ferndz Over a year ago

can you please provide some writeup with your answer. Just code only answers are not recommended on Stack Overflow.

wheezay Over a year ago

done, not a great explanation but it would do the job :)

Paul Brennan · Accepted Answer · 2020-11-21 05:46:09Z

0

With the test file

col
3/12/1970
3/1/1942
10/20/1945  10/20/1945
10/27/1960
10/5/1952

saved as /project/test/test.csv

import pandas as pd
df = pd.read_csv('~/project/test/test.csv')

gives

    col
0   3/12/1970
1   3/1/1942
2   10/20/1945 10/20/1945
3   10/27/1960
4   10/5/1952

Then your example

df['fixed'] = df['col'].str[:10]

Gives

    col                     fixed
0   3/12/1970               3/12/1970
1   3/1/1942                3/1/1942
2   10/20/1945 10/20/1945   10/20/1945
3   10/27/1960              10/27/1960
4   10/5/1952               10/5/1952

Basically this worked and something about your case is different and is causing the difference.

answered Nov 21, 2020 at 5:46

Paul Brennan

2,7364 gold badges23 silver badges27 bronze badges

1 Comment

John Taylor Over a year ago

Thanks for the suggestions. I will dig into this and report back. I’m aware of split and may have tried it but didn’t report that. Let new check it out with my troublesome data.

Collectives™ on Stack Overflow

pandas and trouble with duplicate dates

2 Answers 2

2 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related