Pandas Integer Filtering

Question

I have made a dataframe using a function I created

data = generate_xml()

I then make a subset of the dataframe based on the column names which in this case are called WalmartIDS and ASINS. Below is also an example of what the dataframe looks like

walmartIDS = data.loc[:,['WalmartIDS','ASINS']]

>>
    WalmartIDS  ASINS
602 20511489    B077BS6737
603 10311487    B077BMHVG7
604 10311302    B077BRTYCS
605 152381151   B077YW9PTQ
606 The-Holiday-Aisle-Projection-Kaleidoscope-Spider-Airblown-Inflatable-            Halloween-Decoration-THDA5581.html B076CNN6K5
607 51409868    B0756DMVSC
608 51410962    B0756FKLCV
609 51411020    B0756F3F6J
610 51411529    B0756FDM74
611 915505165   B076W25SDZ
612 400796633   B076VM75ZF

As you can see sometimes bad data will get into the WalmartIDS column. So I want to filter this out by deleting all of the rows in the newly created walmartIDS dataframe where the WalmartIDS column contains characters other than integers. I don't want to alter the data version of the data frame because it is the raw data.

walmartIDS[walmartIDS.WalmartIDS != '^[-+]?[0-9]+$']

However, the above solution doesn't seem to do anything, and I can in fact still see the bad data, (in the example this is row 606) where it should have been deleted.

What is the proper way to do this?

Dodge · Accepted Answer · 2018-08-21 23:27:31Z

4

Make a copy, convert to numeric, drop na:

Test data:

data = StringIO("""

Walmart  IDS         ASINS
602 20511489    B077BS6737
603 10311487    B077BMHVG7
604 10311302    B077BRTYCS
605 152381151   B077YW9PTQ
606 The-Holiday-Aisle-Projection-Kaleidoscope-Spider-Airblown-Inflatable-Halloween-Decoration-THDA5581.html   B076CNN6K5
607 51409868    B0756DMVSC
608 51410962    B0756FKLCV
609 51411020    B0756F3F6J
610 51411529    B0756FDM74
611 915505165   B076W25SDZ
612 400796633   B076VM75ZF

""")

Create df and make a copy:

df = pd.read_table(data, delim_whitespace=True)

df2 = df

Convert IDS to numeric and drop rows with na:

df2['IDS'] = pd.to_numeric(df2['IDS'], errors="coerce")

df2.dropna(how="any", inplace=True)

print(df2)

   Walmart          IDS       ASINS
0       602   20511489.0  B077BS6737
1       603   10311487.0  B077BMHVG7
2       604   10311302.0  B077BRTYCS
3       605  152381151.0  B077YW9PTQ
5       607   51409868.0  B0756DMVSC
6       608   51410962.0  B0756FKLCV
7       609   51411020.0  B0756F3F6J
8       610   51411529.0  B0756FDM74
9       611  915505165.0  B076W25SDZ
10      612  400796633.0  B076VM75ZF

answered Aug 21, 2018 at 23:27

Dodge

3,3493 gold badges21 silver badges44 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

lolcats4u Over a year ago

I love this, although pandas preferred to represent the data in scientific notation or a float with a decimal as you can see it, so I had to further filter by forcing it into an integer. Works great though!

Vaishali · Accepted Answer · 2018-08-22 03:07:07Z

3

You can filter using str.isnumeric()

walmartIDS = data.loc[data.WalmartIDS.str.isnumeric()]
walmartIDS

    WalmartIDS  ASINS
602 20511489    B077BS6737
603 10311487    B077BMHVG7
604 10311302    B077BRTYCS
605 152381151   B077YW9PTQ
607 51409868    B0756DMVSC
608 51410962    B0756FKLCV
609 51411020    B0756F3F6J
610 51411529    B0756FDM74
611 915505165   B076W25SDZ
612 400796633   B076VM75ZF

edited Aug 22, 2018 at 3:07

answered Aug 21, 2018 at 23:32

Vaishali

38.5k5 gold badges62 silver badges88 bronze badges

6 Comments

Dodge Over a year ago

FWIW, She did mention that she did not want to alter the raw data.

Vaishali Over a year ago

@W.Dodge, this will just delete the non-numeric row from the dataframe. It doesn't affect the raw data in any way

Dodge Over a year ago

I know that part, but my interpretation of the OP is that she wanted to have a df of the original data and a df of the processed data.

Dodge Over a year ago

The question reads: "I don't want to alter the data version of the data frame because it is the raw data."

Khalil Al Hooti Over a year ago

@W.Dodge not a big deal just change to df2 = df.loc[df.WalmartIDS.str.isnumeric()]. there are 2 dataframes in memory

|

Community · Accepted Answer · 2020-06-20 09:12:55Z

1

`str.isdigit`

df[df['IDS'].str.isdigit()]

    Walmart        IDS       ASINS
0       602   20511489  B077BS6737
1       603   10311487  B077BMHVG7
2       604   10311302  B077BRTYCS
3       605  152381151  B077YW9PTQ
5       607   51409868  B0756DMVSC
6       608   51410962  B0756FKLCV
7       609   51411020  B0756F3F6J
8       610   51411529  B0756FDM74
9       611  915505165  B076W25SDZ
10      612  400796633  B076VM75ZF

`pd.to_numeric` + `Series.notnull`

df[pd.to_numeric(df['IDS'], errors='coerce').notnull()]

    Walmart        IDS       ASINS
0       602   20511489  B077BS6737
1       603   10311487  B077BMHVG7
2       604   10311302  B077BRTYCS
3       605  152381151  B077YW9PTQ
5       607   51409868  B0756DMVSC
6       608   51410962  B0756FKLCV
7       609   51411020  B0756F3F6J
8       610   51411529  B0756FDM74
9       611  915505165  B076W25SDZ
10      612  400796633  B076VM75ZF

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Aug 21, 2018 at 23:33

cs95

406k106 gold badges745 silver badges798 bronze badges

1 Comment

cs95 Over a year ago

@Vaishali Good question. I think isnumeric checks for floats as well... not sure

Alexander · Accepted Answer · 2018-08-21 23:59:59Z

So that you retain the raw data:

>>> df.join(df.loc[df['IDS'].str.isdigit(), 'IDS'], rsuffix='_clean')
    Walmart IDS ASINS   IDS_clean
0   602 20511489    B077BS6737  20511489
1   603 10311487    B077BMHVG7  10311487
2   604 10311302    B077BRTYCS  10311302
3   605 152381151   B077YW9PTQ  152381151
4   606 The-Holiday-Aisle-Projection-Kaleidoscope-Spid...   B076CNN6K5  NaN
5   607 51409868    B0756DMVSC  51409868
6   608 51410962    B0756FKLCV  51410962
7   609 51411020    B0756F3F6J  51411020
8   610 51411529    B0756FDM74  51411529
9   611 915505165   B076W25SDZ  915505165
10  612 400796633   B076VM75ZF  400796633

The column consisting of valid numeric codes is named IDS_clean. Any text codes (e.g. row 4) will hold NaN values.

Ghasem Naddaf · Accepted Answer · 2018-08-21 23:25:23Z

0

you need to use a regular expression (re):

import re
walmartIDS[re.match(r'^[-+]?[0-9]+$', walmartIDS.WalmartIDS) is not None]

answered Aug 21, 2018 at 23:25

Ghasem Naddaf

8625 silver badges15 bronze badges

1 Comment

lolcats4u Over a year ago

When I try this I get the following error: TypeError: expected string or bytes-like object

Collectives™ on Stack Overflow

Pandas Integer Filtering

5 Answers 5

Make a copy, convert to numeric, drop na:

1 Comment

6 Comments

`str.isdigit`

`pd.to_numeric` + `Series.notnull`

1 Comment

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Make a copy, convert to numeric, drop na:

1 Comment

6 Comments

str.isdigit

pd.to_numeric + Series.notnull

1 Comment

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related

`str.isdigit`

`pd.to_numeric` + `Series.notnull`