I am trying to clean some data from a csv file. I need to make sure that whatever is in the 'Duration' category matches a certain format. This is how I went about that:
import re
import pandas as pd
data_path = './ufos.csv'
ufos = pd.read_csv(data_path)
valid_duration = re.compile('^[0-9]+ (seconds|minutes|hours|days)$')
ufos_clean = ufos[valid_duration.match(ufos.Duration)]
ufos_clean.head()
This gives me the following error:
TypeErrorTraceback (most recent call last)
<ipython-input-4-5ebeaec39a83> in <module>()
6
7 valid_duration = re.compile('^[0-9]+ (seconds|minutes|hours|days)$')
----> 8 ufos_clean = ufos[valid_duration.match(ufos.Duration)]
9
10 ufos_clean.head()
TypeError: expected string or buffer
I used a similar method to clean data before without the regular expressions. What am I doing wrong?
Edit:
MaxU got me the closest, but what ended up working was:
valid_duration_RE = '^[0-9]+ (seconds|minutes|hours|days)$'
ufos_clean = ufos
ufos_clean = ufos_clean[ufos.Duration.str.contains(valid_duration_RE)]
There's probably a lot of redundancy in there, I'm pretty new to python, but it worked.
ufos.Duration? Type:type(ufos.Duration)ufos.Duration.apply(str)to cast it and see if that worksufos.Durationto a string?