Ive been having trouble reading and updating a csv from yfinance, due to the data in the first few rows of the downloaded csv:
1st row contains the column headers I want (also header - 'Price' - I dont want)
2nd row is junk
3rd row has what I want to be the index header
The downloaded csv (formatted) looks like this:
| Price | Adj Close | Close | High | Low | Open | Volume |
|---|---|---|---|---|---|---|
| Ticker | ^BVSP | ^BVSP | ^BVSP | ^BVSP | ^BVSP | ^BVSP |
| Date | ||||||
| 2014-01-02 | 50341.0 | 50341.0 | 51656.0 | 50246.0 | 51522.0 | 3476300 |
| 2014-01-03 | 50981.0 | 50981.0 | 50981.0 | 50269.0 | 50348.0 | 7360400 |
| 2014-01-06 | 50974.0 | 50974.0 | 51002.0 | 50451.0 | 50980.0 | 3727800 |
| 2014-01-07 | 50430.0 | 50430.0 | 51478.0 | 50429.0 | 50982.0 | 3339500 |
The raw .csv file looks like this:
Price,Adj Close,Close,High,Low,Open,Volume
Ticker,^BVSP,^BVSP,^BVSP,^BVSP,^BVSP,^BVSP
Date,,,,,,,
2014-01-02,50341.0,50341.0,51656.0,50246.0,51522.0,3476300
2014-01-03,50981.0,50981.0,50981.0,50269.0,50348.0,7360400
2014-01-06,50974.0,50974.0,51002.0,50451.0,50980.0,3727800
2014-01-07,50430.0,50430.0,51478.0,50429.0,50982.0,3339500
Once read, I want the df to look like this, where 'Date' is datetime index:
| Date | Adj Close | Close | High | Low | Open | Volume |
|---|---|---|---|---|---|---|
| 2014-01-02 | 50341.0 | 50341.0 | 51656.0 | 50246.0 | 51522.0 | 3476300 |
| 2014-01-03 | 50981.0 | 50981.0 | 50981.0 | 50269.0 | 50348.0 | 7360400 |
| 2014-01-06 | 50974.0 | 50974.0 | 51002.0 | 50451.0 | 50980.0 | 3727800 |
| 2014-01-07 | 50430.0 | 50430.0 | 51478.0 | 50429.0 | 50982.0 | 3339500 |
I'm using this code, which works, but it seems clumsy.
idx_df = pd.read_csv(
f'{data_folder}/INDEX_{idx_code}.csv',
header=None,
skiprows=3, # data starts on row 4
names=['Date', 'Adj Close', 'Close', 'High', 'Low', 'Open', 'Volume'],
index_col='Date'
)
idx_df.index = pd.to_datetime(idx_df.index, errors='coerce')
My questions:
- Is there a simpler/more elegant way, perhaps using one line of code, and using the "header" parameter, even though "Date" is at position (2, 0) and the others are at (0, 1:6)?
- Is there a way to set the index as datetime within the "read_csv" instruction, avoiding the "idx_df.index =" line?
Thanks