5

I have a csv file with four columns. I read it like this:

df = pd.read_csv('my.csv', error_bad_lines=False, sep='\t', header=None, names=['A', 'B', 'C', 'D'])

Now, field C contains string values. But in some rows there are non-string type (floats or numbers) values. How to drop those rows? I'm using version 0.18.1 of Pandas.

0

3 Answers 3

8

Setup

df = pd.DataFrame([['a', 'b', 'c', 'd'], ['e', 'f', 1.2, 'g']], columns=list('ABCD'))
print df

   A  B    C  D
0  a  b    c  d
1  e  f  1.2  g

Notice you can see what the individual cell types are.

print type(df.loc[0, 'C']), type(df.loc[1, 'C'])

<type 'str'> <type 'float'>

mask and slice

print df.loc[df.C.apply(type) != float]

   A  B  C  D
0  a  b  c  d

more general

print df.loc[df.C.apply(lambda x: not isinstance(x, (float, int)))]

   A  B  C  D
0  a  b  c  d

you could also use float as an attempt to determine if it can be a float.

def try_float(x):
    try:
        float(x)
        return True
    except:
        return False

print df.loc[~df.C.apply(try_float)]

   A  B  C  D
0  a  b  c  d

The problem with this approach is that you'll exclude strings that can be interpreted as floats.

Comparing times for the few options I've provided and also jezrael's solution with small dataframes.

enter image description here

For a dataframe with 500,000 rows:

enter image description here

Checking if its type is float seems to be most performant with is numeric right behind it. If you need to check int and float, I'd go with jezrael's answer. If you can get away with checking for float, use that one.

Sign up to request clarification or add additional context in comments.

1 Comment

Is there a reason to match "not float or int" instead of "is str"? df.loc[df.C.apply(lambda x: isinstance(x, str))]
2

You can use boolean indexing with mask created by to_numeric with parameter errors='coerce' - you get NaN where are string values. Then check isnull:

df = pd.DataFrame({'A':[1,2,3],
                   'B':[4,5,6],
                   'C':['a',8,9],
                   'D':[1,3,5]})
print (df)
   A  B  C  D
0  1  4  a  1
1  2  5  8  3
2  3  6  9  5

print (pd.to_numeric(df.C, errors='coerce'))
0    NaN
1    8.0
2    9.0
Name: C, dtype: float64

print (pd.to_numeric(df.C, errors='coerce').isnull())
0     True
1    False
2    False
Name: C, dtype: bool

print (df[pd.to_numeric(df.C, errors='coerce').isnull()])
   A  B  C  D
0  1  4  a  1

1 Comment

Is this method efficient for a data frame with 500,000 rows?
1

Use pandas.DataFrame.select_dtypes method. Ex.

df.select_dtypes(exclude='object')
         or
df.select_dtypes(include=['int64','float','int'])

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.