How to drop rows not containing string type in a column in Pandas?

Question

I have a csv file with four columns. I read it like this:

df = pd.read_csv('my.csv', error_bad_lines=False, sep='\t', header=None, names=['A', 'B', 'C', 'D'])

Now, field C contains string values. But in some rows there are non-string type (floats or numbers) values. How to drop those rows? I'm using version 0.18.1 of Pandas.

piRSquared · Accepted Answer · 2016-06-29 06:23:18Z

8

Setup

df = pd.DataFrame([['a', 'b', 'c', 'd'], ['e', 'f', 1.2, 'g']], columns=list('ABCD'))
print df

   A  B    C  D
0  a  b    c  d
1  e  f  1.2  g

Notice you can see what the individual cell types are.

print type(df.loc[0, 'C']), type(df.loc[1, 'C'])

<type 'str'> <type 'float'>

mask and slice

print df.loc[df.C.apply(type) != float]

   A  B  C  D
0  a  b  c  d

more general

print df.loc[df.C.apply(lambda x: not isinstance(x, (float, int)))]

   A  B  C  D
0  a  b  c  d

you could also use float as an attempt to determine if it can be a float.

def try_float(x):
    try:
        float(x)
        return True
    except:
        return False

print df.loc[~df.C.apply(try_float)]

   A  B  C  D
0  a  b  c  d

The problem with this approach is that you'll exclude strings that can be interpreted as floats.

Comparing times for the few options I've provided and also jezrael's solution with small dataframes.

For a dataframe with 500,000 rows:

Checking if its type is float seems to be most performant with is numeric right behind it. If you need to check int and float, I'd go with jezrael's answer. If you can get away with checking for float, use that one.

edited Jun 29, 2016 at 6:23

answered Jun 29, 2016 at 6:09

piRSquared

296k68 gold badges509 silver badges654 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Aaron Bramson Over a year ago

Is there a reason to match "not float or int" instead of "is str"? df.loc[df.C.apply(lambda x: isinstance(x, str))]

jezrael · Accepted Answer · 2016-06-29 06:02:32Z

2

You can use boolean indexing with mask created by to_numeric with parameter errors='coerce' - you get NaN where are string values. Then check isnull:

df = pd.DataFrame({'A':[1,2,3],
                   'B':[4,5,6],
                   'C':['a',8,9],
                   'D':[1,3,5]})
print (df)
   A  B  C  D
0  1  4  a  1
1  2  5  8  3
2  3  6  9  5

print (pd.to_numeric(df.C, errors='coerce'))
0    NaN
1    8.0
2    9.0
Name: C, dtype: float64

print (pd.to_numeric(df.C, errors='coerce').isnull())
0     True
1    False
2    False
Name: C, dtype: bool

print (df[pd.to_numeric(df.C, errors='coerce').isnull()])
   A  B  C  D
0  1  4  a  1

answered Jun 29, 2016 at 6:02

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

1 Comment

Harsh Wardhan Over a year ago

Is this method efficient for a data frame with 500,000 rows?

Aarsh Trivedi · Accepted Answer · 2019-04-07 10:22:16Z

1

Use pandas.DataFrame.select_dtypes method. Ex.

df.select_dtypes(exclude='object')
         or
df.select_dtypes(include=['int64','float','int'])

answered Apr 7, 2019 at 10:22

Aarsh Trivedi

312 bronze badges

Collectives™ on Stack Overflow

How to drop rows not containing string type in a column in Pandas?

3 Answers 3

Setup

1 Comment

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Setup

1 Comment

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related