0

I have a large dataframe (df) where the start looks like:

date,number
2015-12-28,161
2015-12-29,225
2015-12-30,197
2016-06-06,217
2016-06-07,301
2016-06-08,317
2016-06-09,338
2016-06-10,308
2016-10-24,108
2016-10-25,142
2016-10-26,162
2016-10-27,165
2016-10-28,141
2016-01-04,193
2016-01-05,249
2016-01-06,263
2016-01-07,266
2016-01-08,248
2017-01-23,121

This is achieved cycling through a number of directories, opening a specific file and grouping the data in it. Each directory creates part of the final df_final dataframe by the code that is used to generate this is below:

def main():


folder = 'path'
frames = []
df_final = pd.DataFrame()

for dirname, dirs, files in os.walk(folder):
    for filename in files:
        filename_without_extension, extension = os.path.splitext(filename)
        if filename_without_extension == 'portfolio-trade-pos-info':
            
            
            df = pd.read_csv(dirname + '/' +filename, index_col = 'date' )
                                          
            trades = df.groupby('date')[['trade']].count()
            frames.append(trades)

            df_final = df_final.append(df)
            df_final.index_col = 'date'
            df_final.sort_index()

final = pd.concat(frames)
final.sort_values('date')
final.to_csv('trades-per-day.csv', index=True)

I an getting the error:

Traceback (most recent call last):
  File "./trades_per_day.py", line 54, in <module>
    main()
  File "./trades_per_day.py", line 33, in main
    trades = df.groupby('date')[['trade']].count()
  File "/usr/local/lib64/python2.7/site-packages/pandas/core/generic.py", line 3991, in groupby
    **kwargs)
  File "/usr/local/lib64/python2.7/site-packages/pandas/core/groupby.py", line 1511, in groupby
    return klass(obj, by, **kwds)
  File "/usr/local/lib64/python2.7/site-packages/pandas/core/groupby.py", line 370, in __init__
    mutated=self.mutated)
  File "/usr/local/lib64/python2.7/site-packages/pandas/core/groupby.py", line 2462, in _get_grouper
    in_axis, name, gpr = True, gpr, obj[gpr]
  File "/usr/local/lib64/python2.7/site-packages/pandas/core/frame.py", line 2059, in __getitem__
    return self._getitem_column(key)
  File "/usr/local/lib64/python2.7/site-packages/pandas/core/frame.py", line 2066, in _getitem_column
    return self._get_item_cache(key)
  File "/usr/local/lib64/python2.7/site-packages/pandas/core/generic.py", line 1386, in _get_item_cache
    values = self._data.get(item)
  File "/usr/local/lib64/python2.7/site-packages/pandas/core/internals.py", line 3543, in get
    loc = self.items.get_loc(item)
  File "/usr/local/lib64/python2.7/site-packages/pandas/indexes/base.py", line 2136, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/index.pyx", line 132, in pandas.index.IndexEngine.get_loc (pandas/index.c:4433)
  File "pandas/index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas/index.c:4279)
  File "pandas/src/hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13742)
  File "pandas/src/hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13696)
KeyError: 'date

Is there a way to change the data type of the dataframe index in df_final to date so I can order the dataframe in date order?

So the above output would be ordered:

date    number
28/12/2015  161
29/12/2015  225
30/12/2015  197
04/01/2016  193
05/01/2016  249
06/01/2016  263
07/01/2016  266
08/01/2016  248
06/06/2016  217
07/06/2016  301
08/06/2016  317
09/06/2016  338
10/06/2016  308
24/10/2016  108
25/10/2016  142
26/10/2016  162
27/10/2016  165
28/10/2016  141
23/01/2017  121
2
  • pd.to_datetime() Commented Jul 23, 2018 at 21:12
  • 1
    Use df = pd.read_csv(dirname + '/' +filename, parse_dates=['date']) to parse date column on reading in. Commented Jul 23, 2018 at 21:13

1 Answer 1

1

Use parse_dates parameter in pd.read_csv.

MCVE:

from io import StringIO

csvfile = StringIO("""date,number
2015-12-28,161
2015-12-29,225
2015-12-30,197
2016-06-06,217
2016-06-07,301
2016-06-08,317
2016-06-09,338
2016-06-10,308
2016-10-24,108
2016-10-25,142
2016-10-26,162
2016-10-27,165
2016-10-28,141
2016-01-04,193
2016-01-05,249
2016-01-06,263
2016-01-07,266
2016-01-08,248
2017-01-23,121""")

df = pd.read_csv(csvfile, parse_dates=['date'])

df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19 entries, 0 to 18
Data columns (total 2 columns):
date      19 non-null datetime64[ns]
number    19 non-null int64
dtypes: datetime64[ns](1), int64(1)
memory usage: 384.0 bytes
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks, I changed the read_csv line to: df = pd.read_csv(dirname + '/' +filename, parse_dates=['date']). However the resulting final dataframe is still not in date order.
If you want them in the index you can add, index_col = 'date' into that read_csv and afterwards add .sort_index(). Or you can use .sort_values('date').

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.