4

So, I'm experimenting with pandas with the IMDB files, especially title.basic.tsv. When trying to parse the runtimeMinutes column to "Int64", I get an error

ValueError: Unable to parse string "Reality-TV" at position 47993

However, neither line 47994, nor the directly surrounding lines, contain the string Reality-TV. So I started deleting entries from the beginning of the data file, and indeed, the reported position got down. Just until I deleted exactly 47994 entries, at which point the error became

ValueError: Unable to parse string "Reality-TV" at position 65535

This raised my suspicion that the position variable is a uint16 which overflows? Is there a way to deal with this kind of problem, and get the correct line which is making trouble?


Here is the command I used:

titles = pd.read_csv("title.basics.tsv",
                     sep="\t",
                     dtype={
                         "runtimeMinutes": "Int64",
                     },
                     na_values={
                         "runtimeMinutes": ["\\N"],
                     })
0

2 Answers 2

3

I looked at your data and during the analysis of the column "runtimeMinutes" I found that there are str values there, which are causing the error. The picture shows a list of these str values.

Incorrect error

Code search Error:

import pandas as pd

titles = pd.read_csv("title.basics.tsv",
                     sep="\t",
                     na_values={
                         "runtimeMinutes": ["\\N"],
                     })

def search_error_values(df, column):
    error_value = []

    print(f"{'Type':20} | {'Value'}")
    print('-'*53)
    for val in df[column].unique():
        try:
            int(val)
        except:
            print(f"{str(type(val)):20} | {val}")
            error_value.append(val)

    print("\nIncorrect values:", error_value)
    return error_value

values_error = search_error_values(titles, "runtimeMinutes")

I suggest this solution, it will take you more time to load the data. But the long loading will be only once, if you then save the properly processed DataFrame and use it.

Code of the solution:

values_error.append("\\N")

titles = pd.read_csv("title.basics.tsv",
                     sep="\t",
                     dtype={
                         "runtimeMinutes": "Int64",
                     },
                     na_values={
                         "runtimeMinutes": values_error,
                     })
Sign up to request clarification or add additional context in comments.

Comments

3

Knowing the error you can read the csv line per line to search at least the first erroned case, for instance :

import pandas as pd
import csv

with open("title.basics.tsv") as f:
    re = csv.reader(f, delimiter="\t")
    next(re) # bypass header
    for id, row in enumerate(re):
        try:
            rtm = row[7]
            if rtm != "\\N":
               int(rtm)
        except:
            print(id, row[7], row)
            break

execution on my PI5:

bruno@raspberrypi:/tmp $ python p.py  
1096569 Reality-TV ['tt10233364', 'tvEpisode', 'Rolling in the Deep Dish\tRolling in the Deep Dish', '0', '2019', '\\N', '\\N', 'Reality-TV']
bruno@raspberrypi:/tmp $ 

The erroned line is in fact :

tvEpisode   "Rolling in the Deep Dish   "Rolling in the Deep Dish   0   2019    \N  \N  Reality-TV

and the problem is the presence of the "

Of course removing the breakyou can see where are all the 607 erroned lines in your file

1505534 Talk-Show ['tt10970874', 'tvEpisode', 'Die Bauhaus-Stadt Tel Aviv - Vorbild für die Metropolen der Moderne?\tDie Bauhaus-Stadt Tel Aviv - Vorbild für die Metropolen der Moderne?', '0', '2019', '\\N', '\\N', 'Talk-Show']
1891927 Documentary ['tt11670006', 'tvEpisode', '...ein angenehmer Unbequemer...\t...ein angenehmer Unbequemer...', '0', '1981', '\\N', '30', 'Documentary']
2002681 Talk-Show ['tt11868642', 'tvEpisode', 'GGN Heavyweight Championship Lungs With Mike Tyson and Snoop\tGGN Heavyweight Championship Lungs With Mike Tyson and Snoop', '0', '2020', '\\N', '\\N', 'Talk-Show']
2155705 Family,Game-Show ['tt12149332', 'tvEpisode', 'Jeopardy! College Championship Semifinal Game 3\tJeopardy! College Championship Semifinal Game 3', '0', '2020', '\\N', '45', 'Family,Game-Show']
2300253 Reality-TV ['tt12415330', 'tvEpisode', 'Anthony Davis High Brow Tank\tAnthony Davis High Brow Tank', '0', '2017', '\\N', '\\N', 'Reality-TV']
6440150 News,Talk-Show ['tt27147391', 'tvEpisode', 'LATINO Accents QUIZ! w@MrHReviews @EchoBaseNetwork & Romi Dias The Latino Slant\tLATINO Accents QUIZ! w@MrHReviews @EchoBaseNetwork & Romi Dias The Latino Slant', '0', '2023', '\\N', '\\N', 'News,Talk-Show']
6492883 Documentary ['tt27404292', 'tvEpisode', 'Nord-Koreas röda prinsessa\tNord-Koreas röda prinsessa', '0', '2022', '\\N', '\\N', 'Documentary']
6529482 Talk-Show ['tt27493617', 'tvEpisode', 'War Room Round Table: Building an AI Networking Tool That Keeps Your Contacts Organized and Secure: Insights from Zach Hamed of Clay\tWar Room Round Table: Building an AI Networking Tool That Keeps Your Contacts Organized and Secure: Insights from Zach Hamed of Clay', '0', '2023', '\\N', '\\N', 'Talk-Show']
6529516 Talk-Show ['tt27493772', 'tvEpisode', 'War Room Round Table: The 1 Year Anniversary Edition of the Strategic Advisor Board Podcast\tWar Room Round Table: The 1 Year Anniversary Edition of the Strategic Advisor Board Podcast', '0', '2023', '\\N', '\\N', 'Talk-Show']
6596901 Comedy,News,Talk-Show ['tt27675642', 'tvEpisode', "It's not our fault! Bud Light boss LIES about boycott! Tries to get out of Dylan Mulvaney BACKLASH!\tIt's not our fault! Bud Light boss LIES about boycott! Tries to get out of Dylan Mulvaney BACKLASH!", '0', '2023', '\\N', '\\N', 'Comedy,News,Talk-Show']
...
8693717 Comedy,Horror,Mystery ['tt36220949', 'tvEpisode', 'War of the Colossal Beast\tWar of the Colossal Beast', '0', '1969', '\\N', '\\N', 'Comedy,Horror,Mystery']
8693745 Comedy,Horror,Mystery ['tt36220980', 'tvEpisode', 'Circus of Horrors\tCircus of Horrors', '0', '1969', '\\N', '\\N', 'Comedy,Horror,Mystery']
8693748 Comedy,Horror,Mystery ['tt36220983', 'tvEpisode', 'Ghost of Dragstrip Hollow\tGhost of Dragstrip Hollow', '0', '1969', '\\N', '\\N', 'Comedy,Horror,Mystery']
8693898 Comedy,Horror,Mystery ['tt36221179', 'tvEpisode', 'The Day the Earth Froze\tThe Day the Earth Froze', '0', '1970', '\\N', '\\N', 'Comedy,Horror,Mystery']
8693914 Comedy,Horror,Mystery ['tt36221195', 'tvEpisode', 'Black Sunday\tBlack Sunday', '0', '1970', '\\N', '\\N', 'Comedy,Horror,Mystery']
8693918 Comedy,Horror,Mystery ['tt36221199', 'tvEpisode', 'Battle Beyond the Sun\tBattle Beyond the Sun', '0', '1970', '\\N', '\\N', 'Comedy,Horror,Mystery']
9059783 Reality-TV ['tt37812510', 'tvEpisode', "Magnolia's Guide to Hot Air Ballooning\tMagnolia's Guide to Hot Air Ballooning", '0', '2025', '\\N', '\\N', 'Reality-TV']
9181849 Game-Show,Reality-TV ['tt3984412', 'tvEpisode', "I'm Not Going to Come Last, I'm Just Going to Die on The Amazing Race\tI'm Not Going to Come Last, I'm Just Going to Die on The Amazing Race", '0', '2014', '\\N', '\\N', 'Game-Show,Reality-TV']
11809510 Talk-Show ['tt9822816', 'tvEpisode', 'Zwischen Vertuschung und Aufklärung - Missbrauchsgipfel im Vatikan\tZwischen Vertuschung und Aufklärung - Missbrauchsgipfel im Vatikan', '0', '2019', '\\N', '\\N', 'Talk-Show']
11849253 Talk-Show ['tt9909210', 'tvEpisode', 'Politik und/oder Moral - Wie weit geht das Vertrauen der Bürger?\tPolitik und/oder Moral - Wie weit geht das Vertrauen der Bürger?', '0', '2005', '\\N', '\\N', 'Talk-Show']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.