1

I've got a corrupt data frame with random header duplicates inside the data frame. How to ignore or delete these rows while loading the data frame?

Since this random header is in the data frame, pandas raise an error while loading. I would like to ignore this row while loading it with pandas. Or delete it somehow, before loading it with pandas.

The file looks like this:

col1, col2, col3
0, 1, 1
0, 0, 0
1, 1, 1
col1, col2, col3  <- this is the random copy of the header inside the dataframe
0, 1, 1
0, 0, 0
1, 1, 1

I want:

col1, col2, col3
0, 1, 1
0, 0, 0
1, 1, 1
0, 1, 1
0, 0, 0
1, 1, 1
3
  • 1
    if all data are numerics try : pd.read_csv('file.csv').apply(pd.to_numeric,errors='coerce').dropna().reset_index(drop=True) ? Commented Sep 5, 2019 at 15:39
  • Possible duplicate of Pandas dataframe read_csv on bad data Commented Sep 5, 2019 at 15:55
  • A simple solution would be to drop those rows. df = df.drop(df[df.Col1 =="Col1"].index) Commented Mar 22, 2024 at 10:55

2 Answers 2

0

Throw in na_filter = False to typecast your columns into strings. Then locate all rows with bad data then filter them out your dataframe.

>>> df = pd.read_csv('sample.csv', header = 0, na_filter = False)
>>> df
   col1  col2  col3
0     0     1     1
1     0     0     0
2     1     1     1
3  col1  col2  col3
4     0     1     1
5     0     0     0
6     1     1     1
>>> type(df.iloc[0,0])
<class 'str'>

Now that you parsed your data in each column as strings, locate all col1, col2, and col3 values in your df, create a new column if you find them each column using np.where() as such:

>>> df['Tag'] = np.where(((df['col1'] != '0') & (df['col1'] != '1')) & ((df['col2'] != '0') & (df['col2'] != '1')) & ((df['col3'] != '0') & (df['col3'] != '1')), ['Remove'], ['Don\'t remove'])
>>> df
   col1  col2  col3           Tag
0     0     1     1  Don't remove
1     0     0     0  Don't remove
2     1     1     1  Don't remove
3  col1  col2  col3        Remove
4     0     1     1  Don't remove
5     0     0     0  Don't remove
6     1     1     1  Don't remove

Now, filter out the one tagged as Removed in the Tag column using isin().

>>> df2 = df[~df['Tag'].isin(['Remove'])]
>>> df2
  col1 col2 col3           Tag
0    0    1    1  Don't remove
1    0    0    0  Don't remove
2    1    1    1  Don't remove
4    0    1    1  Don't remove
5    0    0    0  Don't remove
6    1    1    1  Don't remove

Drop the Tag column:

>>> df2 = df2[['col1', 'col2', 'col3']]
>>> df2
  col1 col2 col3
0    0    1    1
1    0    0    0
2    1    1    1
4    0    1    1
5    0    0    0
6    1    1    1

Finally typecast your dataframe into int, if you need it to be an integer:

>>> df2 = df2.astype(int)
>>> df2
   col1  col2  col3
0     0     1     1
1     0     0     0
2     1     1     1
4     0     1     1
5     0     0     0
6     1     1     1
>>> type(df2['col1'][0])
<class 'numpy.int32'>

Note: If you want standard index use:

>>> df2.reset_index(inplace = True, drop = True)
>>> df2
   col1  col2  col3
0     0     1     1
1     0     0     0
2     1     1     1
3     0     1     1
4     0     0     0
5     1     1     1
Sign up to request clarification or add additional context in comments.

Comments

0

You just need to do the following:
Assuming df_raw is your original dataframe with the column headers present both as the column names and repeating in several other rows, your corrected dataframe is df.

# Filter out only the rows without the headers in them.
headers = df_raw.columns.tolist()
df = df_raw[df_raw[headers[0]]!=headers[0]].reset_index(drop=True)

Assumption:
- We assume that the appearance of the first column header means that row has to be dropped.

In Detail
Now a detailed code block for anyone to
- create the data,
- write it into a csv file,
- load it as a dataframe, and then
- remove rows that are headers.

import numpy as np
import pandas as pd

# make a csv file to load as dataframe
data = '''col1, col2, col3
0, 1, 1
0, 0, 0
1, 1, 1
col1, col2, col3
0, 1, 1
0, 0, 0
1, 1, 1'''
# Write the data to a csv file
with open('data.csv', 'w') as f:
    f.write(data)
# Load your data with header=None
df_raw = pd.read_csv('data.csv', header=None)

# Declare which row to find the header data: 
#    assuming the top one, we set this to zero.
header_row_number = 0

# Read in columns headers
headers = df_raw.iloc[header_row_number].tolist()

# Set new column headers
df_raw.columns = headers

# Filter out only the rows without the headers in them
# We assume that the appearance of the 
# first column header means that row has to be dropped
# And reset index (and drop the old index column)
df = df_raw[df_raw[headers[0]]!=headers[0]].reset_index(drop=True)
df

Solution Output

1 Comment

@vince could you please vote this response up, if you think this solved the problem you posed?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.