How to best use read_csv parameters when headers are on different rows, and then make 1st column datetime index

Question

Ive been having trouble reading and updating a csv from yfinance, due to the data in the first few rows of the downloaded csv:

1st row contains the column headers I want (also header - 'Price' - I dont want)
2nd row is junk
3rd row has what I want to be the index header

The downloaded csv (formatted) looks like this:

Price	Adj Close	Close	High	Low	Open	Volume
Ticker	^BVSP	^BVSP	^BVSP	^BVSP	^BVSP	^BVSP
Date
2014-01-02	50341.0	50341.0	51656.0	50246.0	51522.0	3476300
2014-01-03	50981.0	50981.0	50981.0	50269.0	50348.0	7360400
2014-01-06	50974.0	50974.0	51002.0	50451.0	50980.0	3727800
2014-01-07	50430.0	50430.0	51478.0	50429.0	50982.0	3339500

The raw .csv file looks like this:

Price,Adj Close,Close,High,Low,Open,Volume
Ticker,^BVSP,^BVSP,^BVSP,^BVSP,^BVSP,^BVSP
Date,,,,,,,
2014-01-02,50341.0,50341.0,51656.0,50246.0,51522.0,3476300
2014-01-03,50981.0,50981.0,50981.0,50269.0,50348.0,7360400
2014-01-06,50974.0,50974.0,51002.0,50451.0,50980.0,3727800
2014-01-07,50430.0,50430.0,51478.0,50429.0,50982.0,3339500

Once read, I want the df to look like this, where 'Date' is datetime index:

Date	Adj Close	Close	High	Low	Open	Volume
2014-01-02	50341.0	50341.0	51656.0	50246.0	51522.0	3476300
2014-01-03	50981.0	50981.0	50981.0	50269.0	50348.0	7360400
2014-01-06	50974.0	50974.0	51002.0	50451.0	50980.0	3727800
2014-01-07	50430.0	50430.0	51478.0	50429.0	50982.0	3339500

I'm using this code, which works, but it seems clumsy.

idx_df = pd.read_csv(
            f'{data_folder}/INDEX_{idx_code}.csv',
            header=None,
            skiprows=3,  #  data starts on row 4
            names=['Date', 'Adj Close', 'Close', 'High', 'Low', 'Open', 'Volume'],
            index_col='Date'
        )
        idx_df.index = pd.to_datetime(idx_df.index, errors='coerce')

My questions:

Is there a simpler/more elegant way, perhaps using one line of code, and using the "header" parameter, even though "Date" is at position (2, 0) and the others are at (0, 1:6)?
Is there a way to set the index as datetime within the "read_csv" instruction, avoiding the "idx_df.index =" line?

Thanks

Didn't you ask the same question yesterday?

Barmar
– Barmar

2024-12-10 20:12:14 +00:00
Commented Dec 10, 2024 at 20:12 — Barmar
– Barmar, Commented Dec 10, 2024 at 20:12
Yes it was removed for some reason. Wrong format I guess.

AndysPythonStuff
– AndysPythonStuff

2024-12-10 20:57:51 +00:00
Commented Dec 10, 2024 at 20:57 — AndysPythonStuff
– AndysPythonStuff, Commented Dec 10, 2024 at 20:57

Panda Kim · Accepted Answer · 2024-12-10 20:23:13Z

Example

import pandas as pd
import io

csv1 = '''Price,Adj Close,Close,High,Low,Open,Volume
Ticker,^BVSP,^BVSP,^BVSP,^BVSP,^BVSP,^BVSP
Date,,,,,,,
2014-01-02,50341.0,50341.0,51656.0,50246.0,51522.0,3476300
2014-01-03,50981.0,50981.0,50981.0,50269.0,50348.0,7360400
2014-01-06,50974.0,50974.0,51002.0,50451.0,50980.0,3727800
2014-01-07,50430.0,50430.0,51478.0,50429.0,50982.0,3339500
'''

Code

use skiprows, parse_dates, index_col parameter

df = pd.read_csv(
    io.StringIO(csv1), # file path
    skiprows=[1, 2], # skip junk rows
    parse_dates=['Price'], # convert Price column to datetime 
    index_col=0 # set Price column as index
).rename_axis('Date') # rename Price -> Date

df

            Adj Close    Close     High      Low     Open   Volume
Date                                                              
2014-01-02    50341.0  50341.0  51656.0  50246.0  51522.0  3476300
2014-01-03    50981.0  50981.0  50981.0  50269.0  50348.0  7360400
2014-01-06    50974.0  50974.0  51002.0  50451.0  50980.0  3727800
2014-01-07    50430.0  50430.0  51478.0  50429.0  50982.0  3339500

PaulS · Accepted Answer · 2024-12-10 20:30:24Z

1

A possible solution, whose steps are:

read_csv to load a CSV file named file.csv into a dataframe, skipping the rows at indices 2 and 3 using the skiprows parameter.
rename to change the name of the column Price to Date.

pd.read_csv(file.csv, skiprows=[2, 3]).rename({'Price': 'Date'}, axis=1)

Output:

         Date  Adj Close    Close     High      Low     Open   Volume
0  2014-01-02    50341.0  50341.0  51656.0  50246.0  51522.0  3476300
1  2014-01-03    50981.0  50981.0  50981.0  50269.0  50348.0  7360400
2  2014-01-06    50974.0  50974.0  51002.0  50451.0  50980.0  3727800
3  2014-01-07    50430.0  50430.0  51478.0  50429.0  50982.0  3339500

edited Dec 10, 2024 at 20:30

answered Dec 10, 2024 at 20:11

PaulS

27.1k3 gold badges19 silver badges40 bronze badges

Collectives™ on Stack Overflow

How to best use read_csv parameters when headers are on different rows, and then make 1st column datetime index

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related