5

I have an intraday 30-second interval time series data in a CSV file with the following format:

20120105, 080000,   1
20120105, 080030,   2
20120105, 080100,   3
20120105, 080130,   4
20120105, 080200,   5

How can I read it into a pandas data frame with these two different indexing schemes:

1, Combine date and time into a single datetime index

2, Use date as the primary index and time as the secondary index in a multiindex dataframe

What are the pros and cons of these two schemes? Is one generally more preferable than the other? In my case, I would like to look at time-of-the-day analysis but am not entirely sure which scheme will be more convenient for my purpose. Thanks in advance.

1
  • 1
    For time of day analysis you should be well covered using the at_time and between_time methods once you've created a proper DatetimeIndex. Commented Jan 13, 2013 at 0:14

1 Answer 1

7
  1. Combine date and time into a single datetime index

    df = pd.read_csv(io.BytesIO(text), parse_dates = [[0,1]], header = None, index_col = 0)
    print(df)
    #                      2
    # 0_1                   
    # 2012-01-05 08:00:00  1
    # 2012-01-05 08:00:30  2
    # 2012-01-05 08:01:00  3
    # 2012-01-05 08:01:30  4
    # 2012-01-05 08:02:00  5
    
  2. Use date as the primary index and time as the secondary index in a multiindex dataframe

    df2 = pd.read_csv(io.BytesIO(text), parse_dates = True, header = None, index_col = [0,1])
    print(df2)
    #                   2
    # 0          1       
    # 2012-01-05 80000  1
    #            80030  2
    #            80100  3
    #            80130  4
    #            80200  5
    

My naive inclination would be to prefer a single index over the multiindex.

  • As the Zen of Python asserts, "Flat is better than nested".
  • The datetime is one conceptual object. Treat it as such. (It is better to have one datetime object than multiple columns for the year, month, day, hour, minute, etc. Similarly, it is better to have one index rather than two.)

However, I am not very experienced with Pandas, and there could be some advantage to having the multiindex when doing time-of-day analysis.

I would try coding up some typical calculations both ways, and then see which one I liked better on the basis of ease of coding, readability, and performance.


This was my setup to produce the results above.

import io
import pandas as pd

text = '''\
20120105, 080000,   1
20120105, 080030,   2
20120105, 080100,   3
20120105, 080130,   4
20120105, 080200,   5'''

You can of course use

pd.read_csv(filename, ...)

instead of

pd.read_csv(io.BytesIO(text), ...)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.