1

I'm trying to select values that have broken the record high or low values. I'm comparing to a DataFrame that has the high and low values for each day as two separate columns. The end goal is to graph a scatterplot of the (date, value) that are the new record values against a line graph of the old record values (using matplotlib.)

Here's an example dataset.

new_data = {'Date': ['1/1/2015', '1/2/2015', '1/3/2015', '1/4/2015', '1/5/2015'],
        'new_low': [10, 25, 24, 21, 15],
        'new_high': [35, 37, 38, 55, 47]}


record_data = {'Day': ['1/1', '1/2', '1/3', '1/4', '1/5'],
           'record_low': [12, 28, 21, 25, 15],
           'record_high': [30, 40, 36, 57, 46]}

df_new = pd.DataFrame(new_data)
df_new.set_index('Date', inplace=True)

df_record = pd.DataFrame(record_data)
df_record.set_index('Day', inplace=True)

So it would look like this

           new_low   new_high (new_data)
Date            
1/1/2015     10         35
1/2/2015     25         37
1/3/2015     24         38
1/4/2015     21         55
1/5/2015     15         47


       record_low   record_high (record_data)
Date            
1/1       12           30
1/2       28           40
1/3       21           36
1/4       25           57
1/5       15           46

I want the result to look along this line.

       Date  Record Value
0  1/1/2015            10
1  1/2/2015            25
2  1/4/2015            21
3  1/1/2015            35
4  1/3/2015            38
5  1/5/2015            47

Since I need to use the result with matplotlib to make a scatterplot, I will need a list of x-values and y-values to enter. My example result was a dataframe that I made, but it doesn't need to be. I could use two separate arrays or even a list of tuples that I could unzip into lists of x and y.

I feel like there should be some simple/elegant way to do this with mapping, but I'm not experienced enough to find it and I haven't been able to find an answer elsewhere.

I'm also having some issues with how to enter the record data with just a month and day as a datestamp, so I've just set them all to the same year. It works for my visualization, but I would rather not do that to the data.

1
  • The record value for 1/5/2015 should be 47 in your example. Commented Mar 21, 2018 at 19:41

2 Answers 2

1

Edited to address comments

This is a solution assuming data is read in from a file and avoids merging the two dfs to compare them (note the reindex step).

# # skip the header and ensure the same naming of the columns
# # df_record has Date in format mon/day
df_record = pd.read_csv('record_data.tsv', sep='\t', 
                    skiprows=1, names=['Date','X', 'Y'], index_col = 'Date')

# #df_new has Date in format 'month/day/year'
df_new = pd.read_csv('new_data.tsv', sep='\t', skiprows=1, names=['Date','X', 'Y'])
df_new = df_new.set_index(df_new['Date'].apply(lambda x: "/".join(x.split('/')[:-1]))).drop('Date', axis = 1)

df_new = df_new.reindex(df_record.index)

# compare the columns
tdfX = (df_new['X'] < df_record['X'])
tdfY = (df_new['Y'] > df_record['Y'])

# get the data that is a new record
df_plot = pd.concat([df_new.loc[tdfY[tdfY].index, 'Y'], df_new.loc[tdfX[tdfX].index, 'X']]).to_frame('Record').reset_index()
Sign up to request clarification or add additional context in comments.

7 Comments

I could probably handle just having the day and month for the index, but the actual problem that I'm dealing with has a dataframe that contains each day of the year, so it would be more helpful if you showed how to take the "dates" and change them into "days" since I'm reading it in from a file and not typing in each entry. Is the df_new['X'] < df_record['X'] equivalent to the df.X_x < df.X_y in @Alollz answer?
Ahh ok, assuming you are reading from a file for both data sets, I edited to include code to make DataFrames the same if desired. I assumed that your tables have headers- which you can change while reading the file.
Your method does work. I was wondering why you include date in df_plot instead of just using df_plot.reset_index(inplace = True) at the end. I don't know if it would simplify much, but it seems like reducing something in a for loop would be a good idea. I was also wondering why you used the t in tdfX. Is that a common formatting for a boolean map?
You need to be careful with this method... This works because your example DataFrames have perfect overlap with indices. If one has a date the other does not your comparisons will throw ValueError: Can only compare identically-labeled Series objects.
And given that Feb 29 only occurs every 4 years, it's likely that your record df contains Feb 29, but your new_df does not.
|
0

There's probably a better answer out there, but you could merge the two DataFrames together, and then determine if the df_new value is a record by comparing the columns.

I wouldn't set the dates as an index, just keep them as a column. It makes it a little bit nicer. If they are your indices, then do this first:

import pandas as pd
df_new['Date'] = df_new.index
df_record['Day'] = df_record.index

Then:

df_new['day'] = pd.to_datetime(df_new.Date).dt.day
df_new['month'] = pd.to_datetime(df_new.Date).dt.month

df_record['day'] = pd.to_datetime(df_record.Day, format='%M/%d').dt.day
df_record['month'] = pd.to_datetime(df_record.Day, format='%M/%d').dt.month

Merge the DataFrames and drop the columns we no longer need:

df = df_new.merge(df_record, on=['month', 'day']).drop(columns=['month', 'day', 'Day'])

Then check if a value is a record. If so, create a new DataFrame with the record values:

record_low = df.X_x < df.X_y
record_high = df.Y_x > df.Y_y
pd.DataFrame({'Date': df[record_low]['Date'].tolist() + df[record_high]['Date'].tolist(), 
 'Record Value': df[record_low]['X_x'].tolist() + df[record_high]['Y_x'].tolist()})

    Date    Record Value
0   1/1/2015    10
1   1/2/2015    25
2   1/4/2015    21
3   1/1/2015    35
4   1/3/2015    38
5   1/5/2015    47

6 Comments

What version of pandas are you using? Your first code block fails on version 0.22.0. This works in its place on my version: df_new['day'] = pd.to_datetime(df_new.index).day. The other lines would be similarly changed.
I'm using 0.22.0. That was why I was saying it's better to leave the dates as a column instead of setting them as an index. With them as an index you will lose the 'Date' and 'Day' information on the merge. So either don't set them as an index or at the top do df_new['Date'] = df_new.index and df_record['Day'] = df_record.index
Oops seems like I missed that part of your answer
No worries, I updated the answer that way you can run it with the OPs given DataFrame.
Thanks for the quick answer. I'm getting some issues with the 3rd block of code. "TypeError: drop() got an unexpected keyword argument 'columns'." I'm not sure why it's doing that. I was also wondering about the line record_low = df.X_x <df.X_y It looks like you are making a map(?), but I haven't seen that notation before. Does the df.X_x just say to compare each element? And was this before the edit I made in the formating?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.