Reading in specific date lines from a file with pandas python

Question

I am attempting to read in many files. Each file is a daily data file with data every 10 minutes. the data in each file is kind of "chunked up" like this:

2015-11-08 00:10:00 00:10:00
#    z  speed    dir      W   sigW       bck   error 
30   3.32  111.9   0.15   0.12  1.50E+05       0
40   3.85  108.2   0.07   0.14  7.75E+04       0
50   4.20  107.9   0.06   0.15  4.73E+04       0
60   4.16  108.5   0.03   0.19  2.73E+04       0
70   4.06   93.6   0.03   0.23  9.07E+04       0
80   4.06   93.8   0.07   0.28  1.36E+05       0

2015-11-08 00:20:00 00:10:00
#    z  speed    dir      W   sigW       bck   error 
30   3.79  120.9   0.15   0.11  7.79E+05       0
40   4.36  115.6   0.04   0.13  2.42E+05       0
50   4.71  113.6   0.07   0.14  6.84E+04       0
60   5.00  113.3   0.13   0.17  1.16E+04       0
70   4.29   94.2   0.22   0.20  1.38E+05       0
80   4.54   94.1   0.11   0.25  1.76E+05       0

2015-11-08 00:30:00 00:10:00
#    z  speed    dir      W   sigW       bck   error 
30   3.86  113.6   0.13   0.10  2.68E+05       0
40   4.34  116.1   0.09   0.11  1.41E+05       0
50   5.02  112.8   0.04   0.12  7.28E+04       0
60   5.36  110.5   0.01   0.14  5.81E+04       0
70   4.67   95.4   0.14   0.16  7.69E+04       0
80   4.56   95.0   0.15   0.21  9.84E+04       0

...

The file continues on like this every 10 minutes for the whole day. The file name for this file is 151108.mnd. I want my code to read in all files that are for november so 1511??.mnd and I want my code to read in each day file for a whole month grab all of the datetime lines so for the partial data file example I just showed I would want my code to grab 2015-11-08 00:10:00, 2015-11-08 00:20:00, 2015-11-08 00:30:00, etc. store as variables and then go to the next day file (151109.mnd) and grab all the datetime lines and store as date variable and append on to the previously stored dates. And so on and so forth for the whole month. Here is the code I have so far:

import pandas as pd
import glob
import datetime

filename = glob.glob('1511??.mnd')
data_nov15_hereford = pd.DataFrame()
frames = []
dates = []
counter = 1
for i in filename:
    f_nov15_hereford = pd.read_csv(i, skiprows = 32)
    for line in f_nov15_hereford:
        if line.startswith("20"):
            print line
            date_object = datetime.datetime.strptime(line[:-6], '%Y-%m-%d %H:%M:%S %f')
            dates.append(date_object)
            counter = 0
        else:
            counter += 1 
    frames.append(f_nov15_hereford) 
data_nov15_hereford = pd.concat(frames,ignore_index=True)
data_nov15_hereford = data_nov15_hereford.convert_objects(convert_numeric=True)


print dates

This code has some problems because when I print dates it prints out two copies of every date and it also only prints out the first date of every file so 2015-11-08 00:10:00, 2015-11-09 00:10:00, etc. It isn't going line-by-line in every file then once all dates in that file are stored moving on to the next file like I want. Instead it is just grabbing the first date in each file. Any help on this code? Is there an easier way to do what I want? Thanks!

RootTwo · Accepted Answer · 2016-03-02 08:19:04Z

1

A few observations:

First: Why you are only getting the first date in a file:

f_nov15_hereford = pd.read_csv(i, skiprows = 32)
for line in f_nov15_hereford:
    if line.startswith("20"):

The first line reads the file into a pandas dataframe. The second line iterates over the columns of a dataframe, not the rows. As a result, the last line checks to see if the column starts with "20". This only happens once per file.

Second: counter is initialized and it's value gets changed, but it is never used. I presume it was intended to be used to skip over lines in the files.

Third: It might be simpler to collect all the dates into a Python list and then converting that to a pandas dataframe if needed.

import pandas as pd
import glob
import datetime as dt

# number of lines to skip before the first date
offset = 32

# number of lines from one date to the next
recordlength = 9

pattern = '1511??.mnd'

dates = []

for filename in glob.iglob(pattern):

    with open(filename) as datafile:

        count = -offset
        for line in datafile:
            if count == 0:
                fmt = '%Y-%m-%d %H:%M:%S %f'
                date_object = dt.datetime.strptime(line[:-6], fmt)
                dates.append(date_object)

            count += 1 

            if count == recordlength:
                count = 0

data_nov15_hereford = pd.DataFrame(dates, columns=['Dates'])

print dates

answered Mar 2, 2016 at 8:19

RootTwo

4,4361 gold badge13 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

HM14 Over a year ago

This seems to work great! My only complaint is that when I print the dates It still gives me 2 sets. Or if I print np.shape(dates) I get two shapes (2046L,) (2046L,)

HM14 Over a year ago

Nevermind, I think that this is a problem with my notebook and not the code! Thanks so much!

Parfait · Accepted Answer · 2016-03-01 22:38:36Z

Consider modifying the csv data line by line prior to reading in as a dataframe. Below opens original file in the glob list and writes to a temp file moving over dates to last column, removing multiple headers and empty lines.

CSV Data (assuming the text view of csv file looks like below; if different than actual, adjust py code)

2015-11-0800:10:0000:10:00,,,,,,
z,speed,dir,W,sigW,bck,error
30,3.32,111.9,0.15,0.12,1.50E+05,0
40,3.85,108.2,0.07,0.14,7.75E+04,0
50,4.2,107.9,0.06,0.15,4.73E+04,0
60,4.16,108.5,0.03,0.19,2.73E+04,0
70,4.06,93.6,0.03,0.23,9.07E+04,0
80,4.06,93.8,0.07,0.28,1.36E+05,0
,,,,,,
2015-11-0800:10:0000:20:00,,,,,,
z,speed,dir,W,sigW,bck,error
30,3.79,120.9,0.15,0.11,7.79E+05,0
40,4.36,115.6,0.04,0.13,2.42E+05,0
50,4.71,113.6,0.07,0.14,6.84E+04,0
60,5,113.3,0.13,0.17,1.16E+04,0
70,4.29,94.2,0.22,0.2,1.38E+05,0
80,4.54,94.1,0.11,0.25,1.76E+05,0
,,,,,,
2015-11-0800:10:0000:30:00,,,,,,
z,speed,dir,W,sigW,bck,error
30,3.86,113.6,0.13,0.1,2.68E+05,0
40,4.34,116.1,0.09,0.11,1.41E+05,0
50,5.02,112.8,0.04,0.12,7.28E+04,0
60,5.36,110.5,0.01,0.14,5.81E+04,0
70,4.67,95.4,0.14,0.16,7.69E+04,0
80,4.56,95,0.15,0.21,9.84E+04,0

Python Script

import glob, os
import pandas as pd

filenames = glob.glob('1511??.mnd')
temp = 'temp.csv'

# INITIATE EMPTY DATAFRAME
data_nov15_hereford = pd.DataFrame(columns=['z', 'speed', 'dir', 'W', 
                                            'sigW', 'bck', 'error', 'date'])

# ITERATE THROUGH EACH FILE IN GLOB LIST
for file in filenames:        
    # DELETE PRIOR TEMP VERSION                    
    if os.path.exists(temp): os.remove(temp)

    header = 0
    # READ IN ORIGINAL CSV
    with open(file, 'r') as txt1:
        for rline in txt1:
            # SAVE DATE VALUE THEN SKIP ROW
            if "2015-11" in rline: date = rline.replace(',',''); continue

            # SKIP BLANK LINES (CHANGE IF NO COMMAS)               
            if rline == ',,,,,,\n': continue

            # ADD NEW 'DATE' COLUMN AND SKIP OTHER HEADER LINES
            if 'z,speed,dir,W,sigW,bck,error' in rline:
                if header == 1: continue
                rline = rline.replace('\n', ',date\n')
                with open(temp, 'a') as txt2:
                    txt2.write(rline)
                continue
            header = 1

            # APPEND LINE TO TEMP CSV WITH DATE VALUE
            with open(temp, 'a') as txt2:
                txt2.write(rline.replace('\n', ','+date))

    # APPEND TEMP FILE TO DATA FRAME
    data_nov15_hereford = data_nov15_hereford.append(pd.read_csv(temp))

Output

     z  speed    dir     W  sigW     bck  error                        date
0   30   3.32  111.9  0.15  0.12  150000      0  2015-11-0800:10:0000:10:00
1   40   3.85  108.2  0.07  0.14   77500      0  2015-11-0800:10:0000:10:00
2   50   4.20  107.9  0.06  0.15   47300      0  2015-11-0800:10:0000:10:00
3   60   4.16  108.5  0.03  0.19   27300      0  2015-11-0800:10:0000:10:00
4   70   4.06   93.6  0.03  0.23   90700      0  2015-11-0800:10:0000:10:00
5   80   4.06   93.8  0.07  0.28  136000      0  2015-11-0800:10:0000:10:00
6   30   3.79  120.9  0.15  0.11  779000      0  2015-11-0800:10:0000:20:00
7   40   4.36  115.6  0.04  0.13  242000      0  2015-11-0800:10:0000:20:00
8   50   4.71  113.6  0.07  0.14   68400      0  2015-11-0800:10:0000:20:00
9   60   5.00  113.3  0.13  0.17   11600      0  2015-11-0800:10:0000:20:00
10  70   4.29   94.2  0.22  0.20  138000      0  2015-11-0800:10:0000:20:00
11  80   4.54   94.1  0.11  0.25  176000      0  2015-11-0800:10:0000:20:00
12  30   3.86  113.6  0.13  0.10  268000      0  2015-11-0800:10:0000:30:00
13  40   4.34  116.1  0.09  0.11  141000      0  2015-11-0800:10:0000:30:00
14  50   5.02  112.8  0.04  0.12   72800      0  2015-11-0800:10:0000:30:00
15  60   5.36  110.5  0.01  0.14   58100      0  2015-11-0800:10:0000:30:00
16  70   4.67   95.4  0.14  0.16   76900      0  2015-11-0800:10:0000:30:00
17  80   4.56   95.0  0.15  0.21   98400      0  2015-11-0800:10:0000:30:00

Collectives™ on Stack Overflow

Reading in specific date lines from a file with pandas python

2 Answers 2

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related