Read CSV file with quotechar-comma combination in string - Python

Question

I have got multiple csv files which look like this:

ID,Text,Value
1,"I play football",10
2,"I am hungry",12
3,"Unfortunately",I get an error",15

I am currently importing the data using the pandas read_csv() function.

df = pd.read_csv(filename, sep = ',', quotechar='"')

This works for the first two rows in my csv file, unfortunately I get an error in row 3. The reason is that within the 'Text' column there is a quotechar character-comma combination before the end of the column.

ParserError: Error tokenizing data. C error: Expected 3 fields in line 4, saw 4

Is there a way to solve this issue?

Expected output:

ID  Text                            Value
1   I play football                 10
2   I am hungry                     12
3   Unfortunately, I get an error   15

It will be rejected by Pandas read_csv; so why be complicated - edit the CSV files either manually or write a Python code to do it by reading in lines and changing as required. — user19077881
– user19077881, Commented Dec 28, 2022 at 14:42
Thanks for your answer. manually editing is not an option. This is a repetitive weekly action for 100+ csv files. How can I read in lines and changing them as required automatically? — Koot6133
– Koot6133, Commented Dec 28, 2022 at 14:46
Fix the code generating the 100+ CSV files so it doesn't generate invalid CSVs. — Mark Tolonen
– Mark Tolonen, Commented Dec 29, 2022 at 1:20

Andrej Kesely · Accepted Answer · 2022-12-28 15:24:44Z

2

You can try to fix the CSV using re module:

import re
import pandas as pd
from io import StringIO

with open("your_file.csv", "r") as f_in:
    s = re.sub(
        r'"(.*)"',
        lambda g: '"' + g.group(1).replace('"', "\\") + '"',
        f_in.read(),
    )

df = pd.read_csv(StringIO(s), sep=r",", quotechar='"', escapechar="\\")
print(df)

Prints:

   ID                          Text  Value
0   1               I play football     10
1   2                   I am hungry     12
2   3  Unfortunately,I get an error     15

answered Dec 28, 2022 at 15:24

Andrej Kesely

196k15 gold badges60 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

nickarafyllis · Accepted Answer · 2022-12-28 15:29:07Z

One (not so flexible) approach would be to firstly remove all " quotes from the csv, and then enclose the elements of the specific column with "" quotes(this is done to avoid misinterpreting the "," seperator while parsing), like this:

import csv

# Specify the column index (0-based)
column_index = 1

# Open the input CSV file
with open('input.csv', 'r') as f:
    reader = csv.reader(f)

    # Open the output CSV file
    with open('output.csv', 'w', newline='') as g:
        writer = csv.writer(g)

        # Iterate through the rows of the input CSV file
        for row in reader:
            # Replace the " character with an empty string
            row[column_index] = row[column_index].replace('"', '')
            # Enclose the modified element in "" quotes
            row[column_index] = f'"{row[column_index]}"'
            # Write the modified row to the output CSV file
            writer.writerow(row)

This code creates a new modified csv file

Then your problematic csv row will look like that: 3,"Unfortunately,I get an error",15"

Then you can import the data like you did: df = pd.read_csv(filename, sep = ',', quotechar='"')

To automate this conversion for all csv files within a directory:

import csv
import glob

# Specify the column index (0-based)
column_index = 1

# Get a list of all CSV files in the current directory
csv_files = glob.glob('*.csv')

# Iterate through the CSV files
for csv_file in csv_files:
    # Open the input CSV file
    with open(csv_file, 'r') as f:
        reader = csv.reader(f)

        # Open the output CSV file
        output_file = csv_file.replace('.csv', '_new.csv')
        with open(output_file, 'w', newline='') as g:
            writer = csv.writer(g)

            # Iterate through the rows of the input CSV file
            for row in reader:
                # Replace the " character with an empty string
                row[column_index] = row[column_index].replace('"', '')
                # Enclose the modified element in "" quotes
                row[column_index] = f'"{row[column_index]}"'
                # Write the modified row to the output CSV file
                writer.writerow(row)

this names the new csv files as the old ones but with "_new.csv" instead of just ".csv".

PaulS · Accepted Answer · 2022-12-28 19:06:39Z

1

A possible solution:

df = pd.read_csv(filename, sep='(?<=\d),|,(?=\d)', engine='python')
df = df.reset_index().set_axis(['ID', 'Text', 'Value'], axis=1)
df['Text'] = df['Text'].replace('\"', '', regex=True)

Another possible solution:

df = pd.read_csv(StringIO(text), sep='\t')
df[['ID', 'Text']] = df.iloc[:, 0].str.split(',', expand=True, n=1)
df[['Text', 'Value']] = df['Text'].str.rsplit(',', expand=True, n=1)
df = df.drop(df.columns[0], axis=1).assign(
    Text=df['Text'].replace('\"', '', regex=True))

Output:

   ID                          Text  Value
0   1               I play football     10
1   2                   I am hungry     12
2   3  Unfortunately,I get an error     15

edited Dec 28, 2022 at 19:06

answered Dec 28, 2022 at 15:21

PaulS

27.1k3 gold badges19 silver badges40 bronze badges

Collectives™ on Stack Overflow

Read CSV file with quotechar-comma combination in string - Python

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related