Pandas index match multiple dataframes with multiple criteria

Question

I am trying to make python read an excel file, then create dataframes from .csv files who are named after rows in the excel file and index data from the .csv files and paste them in the excel file.

the excel file has been put in a dataframe, which has the following layout:

     Name  Location      Date Check_2  ...  Volume  VWAP  $Volume  Trades
0  Orange  New York  20200501       X  ...     NaN   NaN      NaN     NaN
1   Apple     Minsk  20200504       X  ...     NaN   NaN      NaN     NaN

The empty rows should be filled with data that is indexed from .csv files who have been put in a dataframe, which looks like this:

  Name      Date      Time  Open  High   Low  Close  Volume  VWAP  Trades
4   Orange  20200501  15:30:00  5.50  5.85  5.45   5.70    1500  5.73      95
5   Orange  20200501  17:00:00  5.65  5.70  5.50   5.60    1600  5.65      54
6   Orange  20200501  20:00:00  5.80  5.85  5.45   5.81    1700  5.73      41
7   Orange  20200501  22:00:00  5.60  5.84  5.45   5.65    1800  5.75      62
8   Orange  20200504  15:30:00  5.40  5.87  5.45   5.75    1900  5.83      84
9   Orange  20200504  17:00:00  5.50  5.75  5.40   5.60    2000  5.72      94
10  Orange  20200504  20:00:00  5.80  5.83  5.44   5.50    2100  5.40      55
11  Orange  20200504  22:00:00  5.40  5.58  5.37   5.80    2200  5.35      87
0    Apple  20200504  15:30:00  3.70  3.97  3.65   3.75    1000  3.60      55
1    Apple  20200504  17:00:00  3.65  3.95  3.50   3.80    1200  3.65      68
2    Apple  20200504  20:00:00  3.50  3.83  3.44   3.60    1300  3.73      71
3    Apple  20200504  22:00:00  3.55  3.58  3.35   3.57    1400  3.78      81
4    Apple  20200505  15:30:00  3.50  3.85  3.45   3.70    1500  3.73      95
5    Apple  20200505  17:00:00  3.65  3.70  3.50   3.60    1600  3.65      54
6    Apple  20200505  20:00:00  3.80  3.85  3.45   3.81    1700  3.73      41
7    Apple  20200505  22:00:00  3.60  3.84  3.45   3.65    1800  3.75      62

I have been struggling with filling these empty cells, because I haven't been able to find a way to properly index match across these 2 dataframes.

For example, trying:

intradayho = rdf2[(rdf2['Time']=='15:30:00')]
indexopen = pd.DataFrame(intradayho['Open'])

rdf1['Open'] = rdf1.Date.map(intradayho.set_index('Date')['Open'].to_dict())
print("Open prices rdf1")
print(rdf1['Open'])

produces:

Open prices rdf1
0    5.5
1    3.7

but only takes account into date, so it will copy the open value of the 'Date' column, not 'Name' and 'Date', which is a problem because those are the 2 values that need to be matched.

also, this code produces the following error:

A value is trying to be set on a copy of a slice from a DataFrame.Try using .loc[row_indexer,col_indexer] = value instead

but when I try to fix that with

rdf1.loc[rdf1['Open']] = rdf1.Date.map(intradayho.set_index('Date')['Open'].to_dict())

I get an error:

KeyError: "None of [Float64Index([nan, nan], dtype='float64')] are in the [index]"

Which doesn't make sense to me, because the whole goal is to fill these 'NaN' values.

Can someone here help me out with making something that can index match data from these dataframes and write it to the Excel file?

Thanks!

EDIT: Forgot to post my full code, here it is:

import pandas as pd
import os

#Opening 'Test Tracker.xlsx' to find entities to download
TEST = pd.ExcelFile("Trackers\TEST Tracker.xlsx")
df1 = TEST.parse("Entries")

values1 = df1[['Name', 'Location', 'Date', 'Check_2',
           'Open', 'High', 'Low', 'Close', 'Volume', 'VWAP', '$Volume', 
'Trades']]

#Searching for every row that contains the value 'X' in the column 'Check_2'
rdf1 = values1[values1.Check_2.str.contains("X")]

#Printing dataframe to check
print("First Dataframe")
print(rdf1)

#creating a list for the class objects
Fruits = []

#Generating dataframes from classobjects
for idx, rows in rdf1.iterrows():
    fle = os.path.join('Entities', rows.Location, rows.Name, 'TwoHours.csv')
    col_list = ['Name', 'Date', 'Time', 'Open', 'High', 'Low', 'Close', 'Volume', 'VWAP', 'Trades']
    df3 = pd.read_csv(fle, usecols=col_list, sep=";")
    Fruits.append(df3)

rdf2 = pd.concat(Fruits)
print("Printing Full Data Frame")
print(rdf2)

intradayh = rdf2[(rdf2['Time']>'15:30:00') & (rdf2['Time']<'22:00:00')]
intradayho = rdf2[(rdf2['Time']=='15:30:00')]
indexopen = pd.DataFrame(intradayho['Open'])
intradayhc = rdf2[(rdf2['Time']=='22:00:00')]
indexclose = pd.DataFrame(intradayhc['Close'])

rdf1.loc[rdf1['Open']] = rdf1.Date.map(intradayho.set_index('Date')['Open'].to_dict())
print("Open prices rdf1")
print(rdf1['Open'])

EDIT: Desired output as requested in the comments:

  Name  Location      Date    Open   High   Low    close  volume  VWAP ...
0  Orange  New York  20200501  5.5    5.95  5.45    5.65   6600   5.71  ...
1   Apple     Minsk  20200504  3.7    3.83  3.35    3.57   4900   3.69 ...

I am going for a 1 to 1 match in 'Open', a max value in 'High', a min value in 'Low', a 1 to 1 match in 'Close', a sum value for 'Volume' and 'Trades'. an average for 'VWAP' and the value of 'Volume * VWAP' in '$Volume'.

Appears like all you need to do is a simple merge. Struggling to understand your question though.Can you manually construct expected 2 or three rows of expected output and share? — wwnde
– wwnde, Commented May 26, 2020 at 20:13
Thank you for your response. Yes, I updated the post with the desired output — Jackey12345
– Jackey12345, Commented May 26, 2020 at 20:25
Nice, for the Open and High are you looking for 1 to 1 match or a mean? Also we cant see Apple in the second dataframe. Can you update it please as well? One thing with the forum is there are quick answers when people can see input and output. — wwnde
– wwnde, Commented May 26, 2020 at 20:30
Okay than you for the tip. I will update the post with the entire dataframe — Jackey12345
– Jackey12345, Commented May 26, 2020 at 20:37
Not entire dataframe! Just give a sample of what best presents the situation. — wwnde
– wwnde, Commented May 26, 2020 at 20:38

wwnde · Accepted Answer · 2020-05-26 21:42:00Z

2

df, your nan datframe and df2; your bigger dataframe with all data

Use groupby together with .agg() to find multiple aggregations on multiple columns

df2=df1.groupby(['Name','Date']).agg(Open=('Open','first'), Close=('Close','last'),High=('High','max'),Low=('Low','min'),Volume=('Volume','sum'),VWAP=('VWAP','mean')).reset_index()

One way is then to do an inner merge and slice the updated columns

result = pd.merge(df2, df, how='inner', on=['Name', 'Date']).iloc[:,:-4]

or after aggregation, use combine_first and drop all the NaNs

result= (df.set_index('Date').combine_first(df2.set_index('Date')).reset_index())
result=result[k.notna()]

result

edited May 26, 2020 at 21:42

answered May 26, 2020 at 21:23

wwnde

26.7k6 gold badges22 silver badges38 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

Jackey12345 Over a year ago

Thank you for your help. Unfortunately this does not seem to reproduce your results in my code. It is returning the final dataframe, but all the values are still: NaN. I think that the issue is that your code adjusts the columns names for some reason, as my terminal is returning the columns back as: "Open_X", "Close_X", "Open_Y,", .... Therefore I think that python does not recognize them, but I am not sure. I hope you can help to fix that issue

Jackey12345 Over a year ago

O i now see that you updated it with something, ill check that

wwnde Over a year ago

The code should work. Just ensure you put in the right column names. in the agg function, please remember it is .agg(NewName=(ColumnName, function)). In this case, I made NewName same as ColumnName. Just check column names it will be alright

Jackey12345 Over a year ago

Unfortunately your second option didn't work out for me either. There are a few things i do not understand about your code and that might be it. you said: df = NaN dataframe, in which I think you mean the first dataframe (which is rdf1 for me) which represents the excel file. Then you say that df2 = the big one, which i get. But in your code, you seem to use 'df', 'df1' and 'df2', so I messed around with these but didnt got it to work. Also, you use result=result[k.notna()], but 'k' is not defined and swapping it with 'result' or 'rdf1' doesn't work either

Jackey12345 Over a year ago

The 'output' for me is: "NaN' in each column that was empty already, for if you wonder

|

Collectives™ on Stack Overflow

Pandas index match multiple dataframes with multiple criteria

1 Answer 1

10 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

10 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related