1

I have imported data from Yfinance using

historicaldata = pdr.get_data_yahoo(tickers, period='1mo', interval='5m', prepost=True, group_by='ticker') historicaldata.to_csv('./store/historicaldata.csv')

which creates the following csv

historicaldata.csv

What I now want to do is pull in the "M6A=F" & "Volume" column and do a sum on the total volume but I am running into a problem trying to get at that column from the csv file using pandas.

If I try to load the columns back in using

voldata=pd.read_csv('./store/historicaldata.csv', usecols=['M6A=F'])

voldata.to_csv('volumetest.csv')

it only brings in the first "M6A=F" column it finds (the one with "Open" in the second row in the previous image) and no amount of filtering attempts helped me get to the M6A=F volume column.

So I did a test output using

voldata=pd.read_csv('./store/historicaldata.csv')

voldata.to_csv('volumetest.csv')

And discovered it creates the following csv where it has changed all the columns to unique data which makes me think it does that when loading it in using pd.read_csv.

enter image description here

(incidentally, I need to do this from the csv to avoid repeatedly pulling large amounts off yfinance every time I want to work on historical data)

What is the correct way for me to get at the "M6A=F" "Volume" column from a csv?

1 Answer 1

2

Since you're reading a csv-file that was produced from a dataframe with multi-index columns, you have to take that into account when reading it back into a dataframe.

Try something like the following:

from pandas_datareader import data as pdr
import yfinance as yf
yf.pdr_override()

tickers = "MYM=F M6A=F"
hist_data = pdr.get_data_yahoo(
    tickers, period="1mo", interval="5m", prepost=True, group_by="ticker"
)

# Writing to csv-file
hist_data.to_csv("hist_data.csv")

# Reading back from csv-file
hist_data = pd.read_csv("hist_data.csv", index_col=0, header=[0, 1])

# Selecting the M6A=F/Volume-column:
volume = hist_data[[("M6A=F", "Volume")]]
print(volume)

The first change is to set an index column by using index_col=0 (obviously the first here). And the second, header=[0, 1], is to make sure that the first 2 rows are used to build the multi-index columns. See here:

header : int, list of int, None, default ‘infer’

... The header can be a list of integers that specify row locations for a multi-index on the columns ...

Result:

                           M6A=F
                          Volume
Datetime                        
2022-06-06 09:40:00-04:00    0.0
2022-06-06 09:45:00-04:00   67.0
2022-06-06 09:50:00-04:00   36.0
2022-06-06 09:55:00-04:00   18.0
2022-06-06 10:00:00-04:00   61.0
...                          ...
2022-07-06 09:20:00-04:00   47.0
2022-07-06 09:25:00-04:00   12.0
2022-07-06 09:30:00-04:00    7.0
2022-07-06 09:31:10-04:00    0.0
2022-07-06 09:31:20-04:00    NaN

[6034 rows x 1 columns]

(I've used double brackets here hist_data[[("M6A=F", "Volume")]] to get a dataframe that shows the column label. If you don't need that, use single brackets hist_data[("M6A=F", "Volume")] etc.)

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.