How to find rows with column values having a particular datatype in a Pandas DATAFRAME

Question

I have a dataframe

name    col1
satya    12
satya    abc
satya    109.12
alex     apple
alex     1000

So now i need to display the rows where column 'col1' has int value in it.O/p looks like

name    col1
satya    12
alex     1000

if search for string value

name    col1
satya    abc
alex     apple

Like wise..please suggest some code lines(may be using reg).

Usually col values are the same type in pandas. For your data, it will be stored more likely col1 and col2 with col1 having int and col2 having str with NaN at appropriate location to fill the holes. — Hun
– Hun, Commented Apr 3, 2016 at 6:52

Sergey Bushmanov · Accepted Answer · 2016-04-04 07:12:13Z

4

Let's start with a simple regex that will evaluate to True if you have an integer and False otherwise:

import re
regexp = re.compile('^-?[0-9]+$')
bool(regexp.match('1000'))
True
bool(regexp.match('abc'))
False

Once you have such a regex you can proceed as follows:

mask = df['col1'].map(lambda x: bool(regexp.match(x)) )
df.loc[mask]

    name    col1
0   satya   12
4   alex    1000

To search for strings you'll do:

regexp_str = re.compile('^[a-zA-Z]+$')
mask_str = df['col1'].map(lambda x: bool(regexp_str.match(x)))
df.loc[mask_str]

    name    col1
1   satya   abc
3   alex    apple

EDIT

The above code would work if dataframe were created by:

df = pd.read_clipboard()

(or, alternatively, all variables were supplied as strings).

If the regex approach works depends on how the df was created. E.g., if it were created with:

df = pd.DataFrame({'name': ['satya','satya','satya', 'alex', 'alex'],
                   'col1': [12,'abc',109.12,'apple',1000] },
                   columns=['name','col1'])

the above code would fail with TypeError: expected string or bytes-like object

To make it work in any case, one would need to explicitly coerce type to str:

mask = df['col1'].astype('str').map(lambda x: bool(regexp.match(x)) )
df.loc[mask]

    name    col1
0   satya   12
4   alex    1000

and the same for strings:

regexp_str = re.compile('^[a-zA-Z]+$')
mask_str = df['col1'].astype('str').map(lambda x: bool(regexp_str.match(x)))
df.loc[mask_str]

    name    col1
1   satya   abc
3   alex    apple

EDIT2

To find a float:

regexp_float = re.compile('^[-\+]?[0-9]*(\.[0-9]+)$')
mask_float = df['col1'].astype('str').map(lambda x: bool(regexp_float.match(x)))
df.loc[mask_float]

    name    col1
2   satya   109.12

edited Apr 4, 2016 at 7:12

answered Apr 3, 2016 at 7:18

Sergey Bushmanov

25.5k8 gold badges63 silver badges84 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Satya Over a year ago

creating mask throwing error, something as "TypeError: expected string or buffer".I am using pandas 0.16.1 , python 3.4.If it has anything to deal with versions please mention.I have imported re module successfully as well.

Satya Over a year ago

@Sergey--what will be the regexp for creating mask for float type? Thanks for explanations, helped me a lot.

Primer · Accepted Answer · 2016-04-03 07:32:36Z

1

In pandas you would do something like this:

mask = df.col1.apply(lambda x: type(x) == int)
print df[mask]

Which would yield your expected output.

answered Apr 3, 2016 at 7:32

Primer

10.4k5 gold badges48 silver badges55 bronze badges

3 Comments

Sergey Bushmanov Over a year ago

I might disappoint you, but this "would NOT yield your expected output."

Satya Over a year ago

@Sergey--can you please explain in which case Primer's code can fail to reproduce expected.(just curious to know).

Sergey Bushmanov Over a year ago

@Satya Integer entered as a string would not be identified as integer. I happened to generate your df by pd.read_clipboard(). In this case this did not work either. If the suggested solution produces desired output, or not, depends on how the df was created.

MaxU - stand with Ukraine · Accepted Answer · 2016-04-03 16:14:38Z

You can check whether the value contains only digits:

In [104]: df
Out[104]:
    name    col1
0  satya      12
1  satya     abc
2  satya  109.12
3   alex   apple
4   alex    1000

Integers:

In [105]: df[~df.col1.str.contains(r'\D')]
Out[105]:
    name  col1
0  satya    12
4   alex  1000

Non-integers:

In [106]: df[df.col1.str.contains(r'\D')]
Out[106]:
    name    col1
1  satya     abc
2  satya  109.12
3   alex   apple

if you want to filter all numeric values (integers/float/decimal) you can use pd.to_numeric(..., errors='coerce'):

In [75]: df
Out[75]:
    name    col1
0  satya      12
1  satya     abc
2  satya  109.12
3   alex   apple
4   alex    1000

In [76]: df[pd.to_numeric(df.col1, errors='coerce').notnull()]
Out[76]:
    name    col1
0  satya      12
2  satya  109.12
4   alex    1000

In [77]: df[pd.to_numeric(df.col1, errors='coerce').isnull()]
Out[77]:
    name   col1
1  satya    abc
3   alex  apple

ihsancemil · Accepted Answer · 2016-04-03 07:10:30Z

0

def is_integer(element):
    try:
        int(element) #if this is str then there will be error
        return 1
    except:
        return 0

You can simply define a function as below then list your items with for loop.

def list_str(list_of_data):
    str_list=[]
    for item in list_of_data: #list_of_data = [[names],[col1s]] if just col1s replace item[2] with item[1]
        if not is_integer(item[2]):
            str_list.append(item)
    return str_list

def list_int(list_of_data):
    int_list=[]
    for item in list_of_data:
        if is_integer(item[2]):
            int_list.append(item)
    return int_list

Hope this can help you

edited Apr 3, 2016 at 7:10

answered Apr 3, 2016 at 6:57

ihsancemil

4325 silver badges16 bronze badges

Comments

Joe T. Boka · Accepted Answer · 2016-04-03 07:47:21Z

0

You can use df.applymap(np.isreal)

df = pd.DataFrame({'col1': [12,'abc',109.12,'apple',1000], 'name': ['satya','satya','satya', 'alex', 'alex']})
df
col1    name
0   12  satya
1   abc     satya
2   109.12  satya
3   apple   alex
4   1000    alex

df2 = df[df.applymap(np.isreal)]
df2
col1    name
0   12  NaN
1   NaN     NaN
2   109.12  NaN
3   NaN     NaN
4   1000    NaN

df2 = df2[df2.col1.notnull()]
df2
col1    name
0   12  NaN
2   109.12  NaN
4   1000    NaN

index_list = df2.index.tolist()
index_list
[0, 2, 4]

df = df.iloc[index_list]
df
col1    name
0   12  satya
2   109.12  satya
4   1000    alex

answered Apr 3, 2016 at 7:47

Joe T. Boka

6,5896 gold badges33 silver badges49 bronze badges

Collectives™ on Stack Overflow

How to find rows with column values having a particular datatype in a Pandas DATAFRAME

5 Answers 5

2 Comments

3 Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

2 Comments

3 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related