0

I am currently working with database which is generated directly from machine instruments.(PI Data) I am trying to remove information from the database which is mainly related to communication error. For example

df_ex = pd.DataFrame({'Ambient': [20.11112, 'Bad',200.22222],
                  'Humidity': [50.47585, 89.14251,'Bad Input'],
                  'Compressor Temp': [50.47585, 89.14251,72.89657],
                  'Bearing Temp': ['Scan off', 74.45551, 52.2227]})

df_ex:

     Ambient   Humidity  Compressor Temp Bearing Temp
0   20.11112   50.47585         50.47585     Scan off
1        Bad   89.14251         89.14251     74.45551
2  200.22222  Bad Input         72.89657      52.2227

Errors received with df is filtered out by using below code.

df_ex.replace(['Bad','Bad Input','Scan Off'],np.nan, inplace=True)
df_ex.replace(" ", np.nan, inplace=True)
df_ex.dropna(inplace = True)

I am trying to figure out way on how can I remove errors "Bad", "Bad Input" etc from dataframe by filtering out unknown error. I am filtering errors manually most of the time from excel file.

Due to huge size of data base trying unique value also doesn't help in filtering as there are many unique decimal values.

I used this code to find unique values to find error in the dataframe

df_ex['Ambient'].unique()

Errors are changing time to time depending on network communication failure (Unknown error). So instead of "Bad" it may have "Bad Data" which I need to remove manually. If I fail to find out correct name of error my code doesn't run.

Can some one help on this?

Thank you for your time.

I tried out = df.apply(pd.to_numeric, errors = 'coerce') which removed errors but changed Date time in to nan. Before

After

4
  • So, to make sure, your error-messages are in df_ex["Ambient"] or..? I.e how do you know if a message is an unknown error? Commented Feb 14, 2022 at 8:58
  • Do bad inputs come always as a string? Commented Feb 14, 2022 at 9:00
  • @ CutePoison Error is in form of string (eg "Failure", "Communication Loss") rest all data is float. Commented Feb 14, 2022 at 9:11
  • @ John Giorgio Yes it is always a string. Commented Feb 14, 2022 at 9:11

1 Answer 1

1

IIUC, you have a dataframe with numerical data and sometimes string values which denote incorrect data.

You could convert to numeric with errors="coerce" to force invalid data to NaN:

out = df_ex.apply(pd.to_numeric, errors='coerce')

output:

     Ambient  Humidity  Compressor Temp  Bearing Temp
0   20.11112  50.47585         50.47585           NaN
1        NaN  89.14251         89.14251      74.45551
2  200.22222       NaN         72.89657      52.22270
Sign up to request clarification or add additional context in comments.

4 Comments

Yes this is working fine for me. The only problem is date time is changing to nan. I used df['DATE'] = pd.to_datetime(df['DATE']) to convert it but it shows Date as int. do you have any suggestion for this?
Can you update your example?
Output I have updated in the main post.
@DBV then only apply to a subset of your columns: cols = ['col1', 'col2'] ; df_ex[cols] = df_ex[cols].apply(pd.to_numeric, errors='coerce')

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.