2

I have got multiple csv files which look like this:

ID,Text,Value
1,"I play football",10
2,"I am hungry",12
3,"Unfortunately",I get an error",15

I am currently importing the data using the pandas read_csv() function.

df = pd.read_csv(filename, sep = ',', quotechar='"')

This works for the first two rows in my csv file, unfortunately I get an error in row 3. The reason is that within the 'Text' column there is a quotechar character-comma combination before the end of the column.

ParserError: Error tokenizing data. C error: Expected 3 fields in line 4, saw 4

Is there a way to solve this issue?

Expected output:

ID  Text                            Value
1   I play football                 10
2   I am hungry                     12
3   Unfortunately, I get an error   15
4
  • 1
    It will be rejected by Pandas read_csv; so why be complicated - edit the CSV files either manually or write a Python code to do it by reading in lines and changing as required. Commented Dec 28, 2022 at 14:42
  • You need to use a different separator. Commented Dec 28, 2022 at 14:46
  • Thanks for your answer. manually editing is not an option. This is a repetitive weekly action for 100+ csv files. How can I read in lines and changing them as required automatically? Commented Dec 28, 2022 at 14:46
  • 2
    Fix the code generating the 100+ CSV files so it doesn't generate invalid CSVs. Commented Dec 29, 2022 at 1:20

3 Answers 3

2

You can try to fix the CSV using re module:

import re
import pandas as pd
from io import StringIO

with open("your_file.csv", "r") as f_in:
    s = re.sub(
        r'"(.*)"',
        lambda g: '"' + g.group(1).replace('"', "\\") + '"',
        f_in.read(),
    )

df = pd.read_csv(StringIO(s), sep=r",", quotechar='"', escapechar="\\")
print(df)

Prints:

   ID                          Text  Value
0   1               I play football     10
1   2                   I am hungry     12
2   3  Unfortunately,I get an error     15
Sign up to request clarification or add additional context in comments.

Comments

1

One (not so flexible) approach would be to firstly remove all " quotes from the csv, and then enclose the elements of the specific column with "" quotes(this is done to avoid misinterpreting the "," seperator while parsing), like this:

import csv

# Specify the column index (0-based)
column_index = 1

# Open the input CSV file
with open('input.csv', 'r') as f:
    reader = csv.reader(f)

    # Open the output CSV file
    with open('output.csv', 'w', newline='') as g:
        writer = csv.writer(g)

        # Iterate through the rows of the input CSV file
        for row in reader:
            # Replace the " character with an empty string
            row[column_index] = row[column_index].replace('"', '')
            # Enclose the modified element in "" quotes
            row[column_index] = f'"{row[column_index]}"'
            # Write the modified row to the output CSV file
            writer.writerow(row)

This code creates a new modified csv file

Then your problematic csv row will look like that: 3,"Unfortunately,I get an error",15"

Then you can import the data like you did: df = pd.read_csv(filename, sep = ',', quotechar='"')

To automate this conversion for all csv files within a directory:

import csv
import glob

# Specify the column index (0-based)
column_index = 1

# Get a list of all CSV files in the current directory
csv_files = glob.glob('*.csv')

# Iterate through the CSV files
for csv_file in csv_files:
    # Open the input CSV file
    with open(csv_file, 'r') as f:
        reader = csv.reader(f)

        # Open the output CSV file
        output_file = csv_file.replace('.csv', '_new.csv')
        with open(output_file, 'w', newline='') as g:
            writer = csv.writer(g)

            # Iterate through the rows of the input CSV file
            for row in reader:
                # Replace the " character with an empty string
                row[column_index] = row[column_index].replace('"', '')
                # Enclose the modified element in "" quotes
                row[column_index] = f'"{row[column_index]}"'
                # Write the modified row to the output CSV file
                writer.writerow(row)

this names the new csv files as the old ones but with "_new.csv" instead of just ".csv".

Comments

1

A possible solution:

df = pd.read_csv(filename, sep='(?<=\d),|,(?=\d)', engine='python')
df = df.reset_index().set_axis(['ID', 'Text', 'Value'], axis=1)
df['Text'] = df['Text'].replace('\"', '', regex=True)

Another possible solution:

df = pd.read_csv(StringIO(text), sep='\t')
df[['ID', 'Text']] = df.iloc[:, 0].str.split(',', expand=True, n=1)
df[['Text', 'Value']] = df['Text'].str.rsplit(',', expand=True, n=1)
df = df.drop(df.columns[0], axis=1).assign(
    Text=df['Text'].replace('\"', '', regex=True))

Output:

   ID                          Text  Value
0   1               I play football     10
1   2                   I am hungry     12
2   3  Unfortunately,I get an error     15

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.