1

I have a csv file I'm trying to read as a pandas dataframe. I need to skip the first 19 rows of comments. I have headers in the 20th row and data in subsequent rows. Only issue is the header row starts with a '#' which shifts the headers over. The rest of the data are delimited with a space. For some reason doing sep=r'#|\s+' introduces two 'Unnamed' columns to the dataset.

Raw Data Input (row number shown):
01|# comments...
02|# comments...
03|# comments...
.
.
.
19|# comments...
20|# Header1 Header2 Header3
21|Data1 Data2 Data3

Code:

df = pd.read_csv(df_path, skiprows=19, sep=r'#|\s+', engine='python', encoding='utf-8')

Output df:

Unnamed:0 Unnamed:1 Header1
Data1 Data2 Data3

Desired Output df:

Header1 Header2 Header3
Data1 Data2 Data3

How can I address the extra '#' in the header row without having this issue? I've also tried using

sep=r'[#|\s+]'
1
  • import pandas as pd # Define the path to your CSV file df_path = 'your_file.csv' # Read the CSV file, skipping the first 19 rows as comments df = pd.read_csv(df_path, skiprows=19, sep='\s+', engine='python', encoding='utf-8', comment='#') # Rename the columns by splitting the first row df.columns = df.columns.str.split().str[-1] # Print the DataFrame print(df) Commented Feb 17, 2024 at 20:20

2 Answers 2

1
# From OP
df = pd.read_csv(
    df_path, skiprows=19, sep=r"#|\s+", engine="python", encoding="utf-8"
)
# Get the column names, minus the unwanted ones
new_cols = df.columns[2:]
# Remove empty columns at the end of DataFrame
df = df.iloc[:, : len(df.columns) - 2]
# Rename the columns
df.columns = new_cols

This will result in:

  Header1 Header2 Header3
0   Data1   Data2   Data3
Sign up to request clarification or add additional context in comments.

3 Comments

This worked, thank you! I'm still confused as to why it would introduce two extra columns. Doesn't make sense to me.
@monicaJ That's because you're using "#" and "\s+" as separators, so "# " (with a space) will create two columns. If you use just "\s+", "#" will be the first column name.
For comparison, consider a line such as ,, - that is two commas, and you split on commas. Pandas will see three fields! There is one field before the first comma, one after it, and one more after the last comma. Same things for you, but you have pound and space instead of comma.
0
import pandas as pd

# Define the path to your CSV file
df_path = 'your_file.csv'

# Read the CSV file, skipping the first 19 rows as comments
df = pd.read_csv(df_path, skiprows=19, sep='\s+', engine='python', encoding='utf-8', comment='#')

# Rename the columns by splitting the first row
df.columns = df.columns.str.split().str[-1]

# Print the DataFrame
print(df)

1 Comment

Using comment = '#' eliminates the header rows I need from the dataframe so the first line ends up being the first row of data instead of headers.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.