Pandas: Large CSV file data manipulation

Question

I have a large dataset from a CSV file. It has two columns, the first is Date/Time in hh:mm:ss:ms form and the other is Pressure in number form. The pressure has randon values throughout that are not numerical values (things like 150+AA42BB43). These appear randomly throughout the 50,000 rows in the file and are not the same.

I need a way to change these Pressure values to numeric so I can perform data manipulation on them.

df_cleaned = df['Pressure'].loc[~df['Pressure'].map(lambda x: isinstance(x, float) | isinstance(x, int))]

I tried this, but it got rid of my Date/Time values and also did not clean all the pressure values while also getting rid of my headers.

I was wondering if anyone had any suggestions on how I can easily clean the data in the 2nd column while also keeping my Date/Time values in the first column accurate.

you should rather use df_cleaned = df.loc[....] (or even df_cleaned = df[....]) instead of df['Pressure'].loc[...] — furas
– furas, Commented Oct 6, 2021 at 0:58
Using df_cleaned = df['Pressure']... you get only one column (Pressure) and you skip other columns - and this is why you don't have Date/Time. And because it is single column so it may give it as Series instead of DataFrame - and this can remove your header(s) because Series (single column) doesn't need header. — furas
– furas, Commented Oct 6, 2021 at 1:05
How exactly do you want to clean the pressure values? Just get rid of the non-numeric characters and convert to a float? — Stuart
– Stuart, Commented Oct 6, 2021 at 1:17

furas · Accepted Answer · 2021-10-06 01:45:33Z

Your problem is that you use

df_cleaned = df['Pressure']

and this get only one column (Pressure) and skip other columns. And when you get single column then it may gives you Series instead of DataFrame - and Series can keep only one column so it doesn't need header to select columns.

You should run it without ['Pressure']

df_cleaned = df.loc[ ~df['Pressure'].map(...) ]

or even

df_cleaned = df[ ~df['Pressure'].map(...) ]

BTW: shorter isinstance(x, (float, int))

But using isinstance may not work if you have float/int values as strings - because isinstance("123", (float, int)) gives False - and you would have to rather try to convert float("123") and int("123") and catch error.

import pandas as pd

data = {
    'DateTime': ['2021.10.04', '2021.10.05', '2021.10.06'], 
    'Pressure': [78, '150+AA42BB43', 23], 
}

df = pd.DataFrame(data)

df_cleaned = df[ df['Pressure'].map(lambda x:isinstance(x, (float, int))) ]

print(df_cleaned)

Result:

     DateTime Pressure
0  2021.10.04       78
2  2021.10.06       23

EDIT:

If you have values as string then you can use to_numeric to convert them and put NaN if value can't be converted

df['Pressure'] = pd.to_numeric(df['Pressure'], errors='coerce')

and then you can filter it usinf isna()

df_cleaned = df[ ~df['Pressure'].isna() ]

import pandas as pd

data = {
    'DateTime': ['2021.10.04', '2021.10.05', '2021.10.06'], 
    'Pressure': ['78.2', '150+AA42BB43', '23'], 
}

df = pd.DataFrame(data)

df['Pressure'] = pd.to_numeric(df['Pressure'], errors='coerce')
print(df)

df_cleaned = df[ ~df['Pressure'].isna() ]
print(df_cleaned)

Maria · Accepted Answer · 2021-10-05 15:32:03Z

1

I think I've an answer if all your non-numerical values are strings.

Have you tried using pandas replace()? Something like:

df['Pressure'].replace(to_replace = r'.+', value=0, inplace=True, regex=True)

I've used a regex to determine "any string". inplace=True allows that the existing dataframe is modified, instead of creating a new one.

Here, the function will replace any string with a given integer. I am not sure which integer you'd like to put there, so I've just used zero as an example. If you'd like a different integer for each string, you could use a map as explained in this answer.

answered Oct 5, 2021 at 15:32

Maria

4076 silver badges15 bronze badges

Collectives™ on Stack Overflow

Pandas: Large CSV file data manipulation

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related