4

I have made a dataframe using a function I created

data = generate_xml()

I then make a subset of the dataframe based on the column names which in this case are called WalmartIDS and ASINS. Below is also an example of what the dataframe looks like

walmartIDS = data.loc[:,['WalmartIDS','ASINS']]

>>
    WalmartIDS  ASINS
602 20511489    B077BS6737
603 10311487    B077BMHVG7
604 10311302    B077BRTYCS
605 152381151   B077YW9PTQ
606 The-Holiday-Aisle-Projection-Kaleidoscope-Spider-Airblown-Inflatable-            Halloween-Decoration-THDA5581.html B076CNN6K5
607 51409868    B0756DMVSC
608 51410962    B0756FKLCV
609 51411020    B0756F3F6J
610 51411529    B0756FDM74
611 915505165   B076W25SDZ
612 400796633   B076VM75ZF

As you can see sometimes bad data will get into the WalmartIDS column. So I want to filter this out by deleting all of the rows in the newly created walmartIDS dataframe where the WalmartIDS column contains characters other than integers. I don't want to alter the data version of the data frame because it is the raw data.

walmartIDS[walmartIDS.WalmartIDS != '^[-+]?[0-9]+$']

However, the above solution doesn't seem to do anything, and I can in fact still see the bad data, (in the example this is row 606) where it should have been deleted.

What is the proper way to do this?

5 Answers 5

4

Make a copy, convert to numeric, drop na:

Test data:

data = StringIO("""

Walmart  IDS         ASINS
602 20511489    B077BS6737
603 10311487    B077BMHVG7
604 10311302    B077BRTYCS
605 152381151   B077YW9PTQ
606 The-Holiday-Aisle-Projection-Kaleidoscope-Spider-Airblown-Inflatable-Halloween-Decoration-THDA5581.html   B076CNN6K5
607 51409868    B0756DMVSC
608 51410962    B0756FKLCV
609 51411020    B0756F3F6J
610 51411529    B0756FDM74
611 915505165   B076W25SDZ
612 400796633   B076VM75ZF

""")

Create df and make a copy:

df = pd.read_table(data, delim_whitespace=True)

df2 = df

Convert IDS to numeric and drop rows with na:

df2['IDS'] = pd.to_numeric(df2['IDS'], errors="coerce")

df2.dropna(how="any", inplace=True)

print(df2)

   Walmart          IDS       ASINS
0       602   20511489.0  B077BS6737
1       603   10311487.0  B077BMHVG7
2       604   10311302.0  B077BRTYCS
3       605  152381151.0  B077YW9PTQ
5       607   51409868.0  B0756DMVSC
6       608   51410962.0  B0756FKLCV
7       609   51411020.0  B0756F3F6J
8       610   51411529.0  B0756FDM74
9       611  915505165.0  B076W25SDZ
10      612  400796633.0  B076VM75ZF
Sign up to request clarification or add additional context in comments.

1 Comment

I love this, although pandas preferred to represent the data in scientific notation or a float with a decimal as you can see it, so I had to further filter by forcing it into an integer. Works great though!
3

You can filter using str.isnumeric()

walmartIDS = data.loc[data.WalmartIDS.str.isnumeric()]
walmartIDS

    WalmartIDS  ASINS
602 20511489    B077BS6737
603 10311487    B077BMHVG7
604 10311302    B077BRTYCS
605 152381151   B077YW9PTQ
607 51409868    B0756DMVSC
608 51410962    B0756FKLCV
609 51411020    B0756F3F6J
610 51411529    B0756FDM74
611 915505165   B076W25SDZ
612 400796633   B076VM75ZF

6 Comments

FWIW, She did mention that she did not want to alter the raw data.
@W.Dodge, this will just delete the non-numeric row from the dataframe. It doesn't affect the raw data in any way
I know that part, but my interpretation of the OP is that she wanted to have a df of the original data and a df of the processed data.
The question reads: "I don't want to alter the data version of the data frame because it is the raw data."
@W.Dodge not a big deal just change to df2 = df.loc[df.WalmartIDS.str.isnumeric()]. there are 2 dataframes in memory
|
1

str.isdigit

df[df['IDS'].str.isdigit()]

    Walmart        IDS       ASINS
0       602   20511489  B077BS6737
1       603   10311487  B077BMHVG7
2       604   10311302  B077BRTYCS
3       605  152381151  B077YW9PTQ
5       607   51409868  B0756DMVSC
6       608   51410962  B0756FKLCV
7       609   51411020  B0756F3F6J
8       610   51411529  B0756FDM74
9       611  915505165  B076W25SDZ
10      612  400796633  B076VM75ZF

pd.to_numeric + Series.notnull

df[pd.to_numeric(df['IDS'], errors='coerce').notnull()]

    Walmart        IDS       ASINS
0       602   20511489  B077BS6737
1       603   10311487  B077BMHVG7
2       604   10311302  B077BRTYCS
3       605  152381151  B077YW9PTQ
5       607   51409868  B0756DMVSC
6       608   51410962  B0756FKLCV
7       609   51411020  B0756F3F6J
8       610   51411529  B0756FDM74
9       611  915505165  B076W25SDZ
10      612  400796633  B076VM75ZF

1 Comment

@Vaishali Good question. I think isnumeric checks for floats as well... not sure
1

So that you retain the raw data:

>>> df.join(df.loc[df['IDS'].str.isdigit(), 'IDS'], rsuffix='_clean')
    Walmart IDS ASINS   IDS_clean
0   602 20511489    B077BS6737  20511489
1   603 10311487    B077BMHVG7  10311487
2   604 10311302    B077BRTYCS  10311302
3   605 152381151   B077YW9PTQ  152381151
4   606 The-Holiday-Aisle-Projection-Kaleidoscope-Spid...   B076CNN6K5  NaN
5   607 51409868    B0756DMVSC  51409868
6   608 51410962    B0756FKLCV  51410962
7   609 51411020    B0756F3F6J  51411020
8   610 51411529    B0756FDM74  51411529
9   611 915505165   B076W25SDZ  915505165
10  612 400796633   B076VM75ZF  400796633

The column consisting of valid numeric codes is named IDS_clean. Any text codes (e.g. row 4) will hold NaN values.

Comments

0

you need to use a regular expression (re):

import re
walmartIDS[re.match(r'^[-+]?[0-9]+$', walmartIDS.WalmartIDS) is not None]

1 Comment

When I try this I get the following error: TypeError: expected string or bytes-like object

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.