0

5 columns (col1 - col5) in a 10-column dataframe (df) should be either blank or have text values only. If any row in these 5 columns has an all numeric value, i need to trigger an error. Wrote the following code to identify rows where the value is all-numeric in 'col1'. (I will cycle through all 5 columns using the same code):

    df2 = df[df['col1'].str.isnumeric()]

I get the following error: ValueError: cannot mask with array containing NA / NaN values

This is triggered because the blank values create NaNs instead of False. I see this when I created a list instead using the following:

    lst = df['col1'].str.isnumeric()

Any suggestions on how to solve this? Thanks

4
  • what error you want to trigger? or you want to replace the numeric values with something else ? Commented Feb 26, 2020 at 15:55
  • Have you tried df['col1'].astype(str).isnumeric() instead? Commented Feb 26, 2020 at 16:01
  • @YOLO This is a part of a bigger code, where I'm doing quality checks on data. In this case I write the error into a column 'Fail: {col1} is numeric'. I cannot use where and directly write this error into the column because the error column needs to record all errors - those found in other cols for this particular check and also for other checks conducted outside of the isnumeric() check. Commented Feb 26, 2020 at 16:01
  • pandas.pydata.org/pandas-docs/stable/reference/api/…. Blank strings create False. If the strings are themselves NaN, consider filling in ''. Commented Feb 26, 2020 at 16:04

2 Answers 2

1

Try this to work around the NaN

import pandas as pd

df = pd.DataFrame([{'col1':1}, {'col1': 'a'}, {'col1': None}])
lst = df['col1'].astype(str).str.isnumeric()
if lst.any():
    raise ValueError()
Sign up to request clarification or add additional context in comments.

1 Comment

your pre-edited code worked for me. I could on the fly replace the nans with text and so my dataframe was created. Haven't tried the revised code. This is what I finally used: 'df2 = df[df['col1'].astype(str).fillna('').str.isnumeric()]' I've marked your answer as the one that solved my question, but you may want to edit your response to also include your original response.
0

Here's a way to do:

import string
df['flag'] = (df
             .applymap(lambda x: any(i for i in x if i in string.digits))
             .apply(lambda x: f'Fail: {",".join(df.columns[x].tolist())} is numeric', 1))

print(df)

   col1  col2                   flag
0     a  2.04  Fail: col2 is numeric
1  2.02     b  Fail: col1 is numeric
2     c     c      Fail:  is numeric
3     d     e      Fail:  is numeric

Explanation:

  • We iterate through each value of the dataframe and check if it is a digit and return a boolean value.
  • We use that boolean value to subset the column names

Sample Data

df = pd.DataFrame({'col1': ['a','2.02','c','d'],
                  'col2' : ['2.04','b','c','e']})

2 Comments

haven't tested yet but this looks more efficient than what I finally used. I didn't use it as I cycle through 3 different checks for each column. Results from each check are written into the same flag col, depending on the existing value in the column: 1. If existing value == 'Pass' replace with 'Fail + {error message}' 2. else append with the additional fail. I don't see how to immediately implement within my current code structure, but think if I write the results from each check into separate columns and then merge, it may work. Will post here tomorrow if it does.
the above didn't work for me because I'm not checking ALL the columns in the dataframe. I get a list of columns to check from another dataframe. Is there a way to feed a list of column headers into your code?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.