4

I am reading in a large csv in pandas with:

features = pd.read_csv(filename, header=None, names=['Time','Duration','SrcDevice','DstDevice','Protocol','SrcPort','DstPort','SrcPackets','DstPackets','SrcBytes','DstBytes'], usecols=['Duration','SrcDevice', 'DstDevice', 'Protocol', 'DstPort','SrcPackets','DstPackets','SrcBytes','DstBytes'])

I get:

sys:1: DtypeWarning: Columns (6) have mixed types. Specify dtype option on import or set low_memory=False.
  %!PS-Adobe-3.0

How can I find the first line in the input which is causing this warning? I need to do this to debug the problem with the input file, which shouldn't have mixed types.

5
  • please see similar question here - stackoverflow.com/questions/24251219/… Commented Dec 15, 2017 at 17:40
  • 1
    @mm441 Thank you but that doesn't seem to contain an answer to how to find the line that causes the warning does it? Commented Dec 15, 2017 at 19:43
  • How big is your file? If it's small enough, "by eye" may be the fastest way. Commented Dec 15, 2017 at 19:57
  • @MadPhysicist About 4 millions lines. Commented Dec 15, 2017 at 19:58
  • Have an intern do it then :) Commented Dec 15, 2017 at 19:58

2 Answers 2

2

Once Pandas has finished reading the file, you can NOT figure out which lines were problematic (see this answer to know why).

This means you should find a way while you are reading the file. For example, read the file line-by-line, and check the types of each line, if any of them doesn't match the expected type then you got the wanted line.

To achieve this with Pandas, you can pass chunksize=1 to pd.read_csv() to read the file in chunks (dataframes with size N, 1 in this case). See the documentation if you want to know more about this.

The code goes something like this:

# read the file in chunks of size 1. This returns a reader rather than a DataFrame
reader = pd.read_csv(filename,chunksize=1)

# get the first chunk (DataFrame), to calculate the "true" expected types
first_row_df = reader.get_chunk()
expected_types = [type(val) for val in first_row_df.iloc[0]] # a list of the expected types.

i = 1 # the current index. Start from 1 because we've already read the first row.
for row_df in reader:
    row_types = [type(val) for val in row_df.iloc[0]]
    if row_types != expected_types:
        print(i) # this row is the wanted one
        break
    i += 1

Note that this code makes an assumption that the first row has "true" types. This code is really slow, so I recommend that you actually only check the columns which you think are problematic (though this does not give much performance gain).

Sign up to request clarification or add additional context in comments.

Comments

1
for endrow in range(1000, 4000000, 1000):
    startrow = endrow - 1000
    rows = 1000
    try:
        pd.read_csv(filename, dtype={"DstPort": int}, skiprows=startrow, nrows=rows, header=None,
                names=['Time','Duration','SrcDevice','DstDevice','Protocol','SrcPort',
                       'DstPort','SrcPackets','DstPackets','SrcBytes','DstBytes'],
                usecols=['Duration','SrcDevice', 'DstDevice', 'Protocol', 'DstPort',
                         'SrcPackets','DstPackets','SrcBytes','DstBytes'])
    except ValueError:
        print(f"Error is from row {startrow} to row {endrows}")

Split the file into multiple dataframes with 1000 rows each to see in which range of rows there is mixed type value that causes this problem.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.