How to find the input line with mixed types

Question

I am reading in a large csv in pandas with:

features = pd.read_csv(filename, header=None, names=['Time','Duration','SrcDevice','DstDevice','Protocol','SrcPort','DstPort','SrcPackets','DstPackets','SrcBytes','DstBytes'], usecols=['Duration','SrcDevice', 'DstDevice', 'Protocol', 'DstPort','SrcPackets','DstPackets','SrcBytes','DstBytes'])

I get:

sys:1: DtypeWarning: Columns (6) have mixed types. Specify dtype option on import or set low_memory=False.
  %!PS-Adobe-3.0

How can I find the first line in the input which is causing this warning? I need to do this to debug the problem with the input file, which shouldn't have mixed types.

please see similar question here - stackoverflow.com/questions/24251219/… — mm441
– mm441, Commented Dec 15, 2017 at 17:40
@mm441 Thank you but that doesn't seem to contain an answer to how to find the line that causes the warning does it? — Simd
– Simd, Commented Dec 15, 2017 at 19:43
How big is your file? If it's small enough, "by eye" may be the fastest way. — Mad Physicist
– Mad Physicist, Commented Dec 15, 2017 at 19:57

Qusai Alothman · Accepted Answer · 2017-12-15 21:40:37Z

Once Pandas has finished reading the file, you can NOT figure out which lines were problematic (see this answer to know why).

This means you should find a way while you are reading the file. For example, read the file line-by-line, and check the types of each line, if any of them doesn't match the expected type then you got the wanted line.

To achieve this with Pandas, you can pass chunksize=1 to pd.read_csv() to read the file in chunks (dataframes with size N, 1 in this case). See the documentation if you want to know more about this.

The code goes something like this:

# read the file in chunks of size 1. This returns a reader rather than a DataFrame
reader = pd.read_csv(filename,chunksize=1)

# get the first chunk (DataFrame), to calculate the "true" expected types
first_row_df = reader.get_chunk()
expected_types = [type(val) for val in first_row_df.iloc[0]] # a list of the expected types.

i = 1 # the current index. Start from 1 because we've already read the first row.
for row_df in reader:
    row_types = [type(val) for val in row_df.iloc[0]]
    if row_types != expected_types:
        print(i) # this row is the wanted one
        break
    i += 1

Note that this code makes an assumption that the first row has "true" types. This code is really slow, so I recommend that you actually only check the columns which you think are problematic (though this does not give much performance gain).

kkkyy · Accepted Answer · 2017-12-15 21:31:29Z

for endrow in range(1000, 4000000, 1000):
    startrow = endrow - 1000
    rows = 1000
    try:
        pd.read_csv(filename, dtype={"DstPort": int}, skiprows=startrow, nrows=rows, header=None,
                names=['Time','Duration','SrcDevice','DstDevice','Protocol','SrcPort',
                       'DstPort','SrcPackets','DstPackets','SrcBytes','DstBytes'],
                usecols=['Duration','SrcDevice', 'DstDevice', 'Protocol', 'DstPort',
                         'SrcPackets','DstPackets','SrcBytes','DstBytes'])
    except ValueError:
        print(f"Error is from row {startrow} to row {endrows}")

Split the file into multiple dataframes with 1000 rows each to see in which range of rows there is mixed type value that causes this problem.

Collectives™ on Stack Overflow

How to find the input line with mixed types

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related