Pandas read_csv() with multiple delimiters not working

Question

I have a csv file I'm trying to read as a pandas dataframe. I need to skip the first 19 rows of comments. I have headers in the 20th row and data in subsequent rows. Only issue is the header row starts with a '#' which shifts the headers over. The rest of the data are delimited with a space. For some reason doing sep=r'#|\s+' introduces two 'Unnamed' columns to the dataset.

Code:

df = pd.read_csv(df_path, skiprows=19, sep=r'#|\s+', engine='python', encoding='utf-8')

Output df:

Unnamed:0	Unnamed:1	Header1
Data1	Data2	Data3

Desired Output df:

Header1	Header2	Header3
Data1	Data2	Data3

How can I address the extra '#' in the header row without having this issue? I've also tried using

sep=r'[#|\s+]'

import pandas as pd # Define the path to your CSV file df_path = 'your_file.csv' # Read the CSV file, skipping the first 19 rows as comments df = pd.read_csv(df_path, skiprows=19, sep='\s+', engine='python', encoding='utf-8', comment='#') # Rename the columns by splitting the first row df.columns = df.columns.str.split().str[-1] # Print the DataFrame print(df) — Atul sanwal
– Atul sanwal, Commented Feb 17, 2024 at 20:20

e-motta · Accepted Answer · 2024-02-17 20:48:32Z

1

# From OP
df = pd.read_csv(
    df_path, skiprows=19, sep=r"#|\s+", engine="python", encoding="utf-8"
)
# Get the column names, minus the unwanted ones
new_cols = df.columns[2:]
# Remove empty columns at the end of DataFrame
df = df.iloc[:, : len(df.columns) - 2]
# Rename the columns
df.columns = new_cols

This will result in:

  Header1 Header2 Header3
0   Data1   Data2   Data3

answered Feb 17, 2024 at 20:48

e-motta

7,5953 gold badges10 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

monicaJ Over a year ago

This worked, thank you! I'm still confused as to why it would introduce two extra columns. Doesn't make sense to me.

e-motta Over a year ago

@monicaJ That's because you're using "#" and "\s+" as separators, so "# " (with a space) will create two columns. If you use just "\s+", "#" will be the first column name.

topsail Over a year ago

For comparison, consider a line such as ,, - that is two commas, and you split on commas. Pandas will see three fields! There is one field before the first comma, one after it, and one more after the last comma. Same things for you, but you have pound and space instead of comma.

Atul sanwal · Accepted Answer · 2024-02-17 20:21:50Z

0

import pandas as pd

# Define the path to your CSV file
df_path = 'your_file.csv'

# Read the CSV file, skipping the first 19 rows as comments
df = pd.read_csv(df_path, skiprows=19, sep='\s+', engine='python', encoding='utf-8', comment='#')

# Rename the columns by splitting the first row
df.columns = df.columns.str.split().str[-1]

# Print the DataFrame
print(df)

answered Feb 17, 2024 at 20:21

Atul sanwal

838 bronze badges

1 Comment

monicaJ Over a year ago

Using comment = '#' eliminates the header rows I need from the dataframe so the first line ends up being the first row of data instead of headers.

Collectives™ on Stack Overflow

Pandas read_csv() with multiple delimiters not working

2 Answers 2

3 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related