1

I am trying to clean some data from a csv file. I need to make sure that whatever is in the 'Duration' category matches a certain format. This is how I went about that:

import re
import pandas as pd

data_path = './ufos.csv'
ufos = pd.read_csv(data_path)

valid_duration = re.compile('^[0-9]+ (seconds|minutes|hours|days)$')
ufos_clean = ufos[valid_duration.match(ufos.Duration)]

ufos_clean.head()

This gives me the following error:

TypeErrorTraceback (most recent call last)
<ipython-input-4-5ebeaec39a83> in <module>()
      6 
      7 valid_duration = re.compile('^[0-9]+ (seconds|minutes|hours|days)$')
----> 8 ufos_clean = ufos[valid_duration.match(ufos.Duration)]
      9 
     10 ufos_clean.head()

TypeError: expected string or buffer

I used a similar method to clean data before without the regular expressions. What am I doing wrong?

Edit:

MaxU got me the closest, but what ended up working was:

valid_duration_RE = '^[0-9]+ (seconds|minutes|hours|days)$'
ufos_clean = ufos
ufos_clean = ufos_clean[ufos.Duration.str.contains(valid_duration_RE)]

There's probably a lot of redundancy in there, I'm pretty new to python, but it worked.

3
  • 2
    And what is ufos.Duration ? Type: type(ufos.Duration) Commented Sep 15, 2016 at 17:32
  • <class 'pandas.core.series.Series'> That would be the problem. I'm going to try to use ufos.Duration.apply(str) to cast it and see if that works Commented Sep 15, 2016 at 17:36
  • So that method of casting didn't work. ufos.Duration to a string? Commented Sep 15, 2016 at 17:45

2 Answers 2

1

You can use vectorized .str.match() method:

valid_duration_RE = '^[0-9]+ (seconds|minutes|hours|days)$'
ufos_clean = ufos[ufos.Duration.str.contains(valid_duration_RE)]
Sign up to request clarification or add additional context in comments.

1 Comment

@i., yes, thank you! I didn't notice .str.match() got deprecated
0

I guess you want it the other way round (not tested):

import re
import pandas as pd

data_path = './ufos.csv'
ufos = pd.read_csv(data_path)

def cleanit(val):
    # your regex solution here
    pass

ufos['ufos_clean'] = ufos['Duration'].apply(cleanit)

After all, ufos is a DataFrame.

1 Comment

I don't quite follow this solution. I'm trying to create a new cleaned DataFrame called ufos_clean that only contains the rows where Duration is in a valid format. I'm not trying to add something new to the existing DataFrame

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.