Pandas Dataframe Append function does not persist

Question

Following this Q&A, i have managed to concatenate several CSV files into one time-series dataframe, appending a column to add the name of CSV file from which each record came, like so:

import os
import glob
import pandas as pd

path = ''

all_files = glob.glob(os.path.join(path, "*.csv")) 

names = [os.path.basename(x) for x in glob.glob(path+'\*.csv')] 

df = pd.DataFrame()
for file_ in all_files:
    file_df = pd.read_csv(file_, sep=',', parse_dates=["capture_datetime_utc"], index_col="capture_datetime_utc")
    file_df['file_name'] = file_ 
    df = df.append(file_df)
df.shape

This seems to work fine, and- as you can see in this Jupyter Notebook -i get a dataframe whose shape has 5 columns.

But then when i downsample this time series df from 15 minute intervals to an hourly mean, like so:

df_h = df.resample('H').mean()
df_h.shape

I get a dataframe whose shape has only 4 columns.

So it seems like this append function i have performed lacks persistence, and i need to make it persist. I have tried inserting the "inplace=True" arg into the append function itself (threw an error) and also after it (made no difference).

If anyone can show me the way to make this appended column permanent, i'd be much obliged!

Your file_name column is being removed because it does not have a numerical dtype. See here: stackoverflow.com/a/34270422/8146556 — rahlf23
– rahlf23, Commented Aug 30, 2018 at 18:24
Thanks for the pointer, @rahlf23; now i know why it's not working. But how can i convert filename (a string, of necessity) to a numeric datatype, i wonder? — Walt
– Walt, Commented Aug 30, 2018 at 18:58
Why are you interested in retaining the filename? You are taking the mean of the data, therefore it's irrelevant. — rahlf23
– rahlf23, Commented Aug 30, 2018 at 19:00
I need to retain the filename so that i can tell from which of 300 different soil sensors the data is coming. Then i am downsampling from 1/4hr to hourly intervals so that i can correlate data from the soil sensor with local weather data, which is logged at hourly intervals. — Walt
– Walt, Commented Aug 30, 2018 at 19:47
So you need to groupby sensor and THEN resample? In other words, you just want to take the mean individually by sensor? — rahlf23
– rahlf23, Commented Aug 30, 2018 at 19:55

rahlf23 · Accepted Answer · 2018-08-30 21:38:55Z

Your file_name column is being removed because it does not have a numerical dtype. Not to mention that since you are effectively aggregating the dataframe via mean(), you should not be interested in retaining the file_name of the original data source. After taking the mean across your concatenated dataframes, that information will be meaningless.

I would recommend using pd.concat() in place of df.append(). Given two sample csv files:

sample1.csv

capture_datetime_utc,fertilizer_level,light,soil_moisture_present,air_temperature_celsius
2018-07-30 17:34:33,-1.0,1.28,12.13,26.42
2018-07-30 17:49:33,-1.0,1.26,11.87,26.51
2018-07-30 18:04:33,-1.0,1.26,11.47,26.37
2018-07-30 18:19:33,-1.0,1.17,12.00,26.28
2018-07-30 18:34:33,-1.0,0.94,11.47,25.34

sample2.csv

capture_datetime_utc,fertilizer_level,light,soil_moisture_present,air_temperature_celsius
2018-08-28 07:50:23,-1.0,40.73,6.53,31.82
2018-08-28 08:05:23,-1.0,47.13,6.65,33.65
2018-08-28 08:20:23,-1.0,51.94,6.65,35.00
2018-08-28 08:35:23,-1.0,57.46,6.65,36.55
2018-08-28 08:50:23,-1.0,14.17,6.77,32.98

You can do the following:

all_files = ['sample1.csv','sample2.csv']

df = pd.concat([pd.read_csv(file_, sep=',', parse_dates=["capture_datetime_utc"], index_col="capture_datetime_utc") for file_ in all_files], keys=all_files)

df = df.reset_index().set_index('capture_datetime_utc').groupby('level_0').resample('H').mean().dropna()

Which gives:

                                  fertilizer_level      light  \
level_0     capture_datetime_utc                                
sample1.csv 2018-07-30 17:00:00               -1.0   1.270000   
            2018-07-30 18:00:00               -1.0   1.123333   
sample2.csv 2018-08-28 07:00:00               -1.0  40.730000   
            2018-08-28 08:00:00               -1.0  42.675000   

                                  soil_moisture_present  \
level_0     capture_datetime_utc                          
sample1.csv 2018-07-30 17:00:00               12.000000   
            2018-07-30 18:00:00               11.646667   
sample2.csv 2018-08-28 07:00:00                6.530000   
            2018-08-28 08:00:00                6.680000   

                                  air_temperature_celsius  
level_0     capture_datetime_utc                           
sample1.csv 2018-07-30 17:00:00                 26.465000  
            2018-07-30 18:00:00                 25.996667  
sample2.csv 2018-08-28 07:00:00                 31.820000  
            2018-08-28 08:00:00                 34.545000

Nicely done, @rahlf23; thank-you. My upvote does not count yet, as i am still a n00b here, but it is nonetheless recorded, apparently.

Collectives™ on Stack Overflow

Pandas Dataframe Append function does not persist

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related