Read daily files and concatenate them

Question

Edit - here is my modified code: http://jsfiddle.net/#&togetherjs=GzytydCsRh

Can someone take a look and give me some feedback? It seems a bit long still but that is the first time I used functions.

I am reading a bunch of CSV files and using glob to concatenate them all together into separate dataframes. I eventually join them together and basically create a single large file which I use to connect to a dashboard. I am not too familiar with Python but I used Pandas and sklearn often.

As you can see, I am basically just reading the last 60 (or more) days worth of data (last 60 files) and creating a dataframe for each. This works, but I am wondering if there is a more Pythonic/better way? I watched a video on pydata (about not being restricted by PEP 8 and making sure your code is Pythonic) which was interesting.

(FYI - the reason I need to read 60 days worth of files is because customers can fill out a survey from a call which happened a long time ago. The customer fills out a survey today about a call that happened in July. I need to know about that call (how long it lasted, what the topic was, etc).

import pandas as pd
import numpy as np
from pandas import *
import datetime as dt
import os
from glob import glob
os.chdir(r'C:\\Users\Documents\FTP\\')
loc = r'C:\\Users\Documents\\'
rosterloc = r'\\mand\\'
splitsname = r'Splits.csv'
fcrname = r'global_disp_'
npsname = r'survey_'
ahtname = r'callbycall_'
rostername = 'Daily_Roster.csv'
vasname = r'vas_report_'
ext ='.csv'
startdate = dt.date.today() - Timedelta('60 day')
enddate = dt.date.today() 
daterange = Timestamp(enddate) - Timestamp(startdate)
daterange = (daterange / np.timedelta64(1, 'D')).astype(int)

data = []
frames = []
calls = []
bracket = []
try:
    for date_range in (Timestamp(startdate) + dt.timedelta(n) for n in range(daterange)):
        aht = pd.read_csv(ahtname+date_range.strftime('%Y_%m_%d')+ext)
        calls.append(aht)
except IOError:
        print('File does not exist:', ahtname+date_range.strftime('%Y_%m_%d')+ext)
aht = pd.concat(calls)
print('AHT Done')                 
try:
    for date_range in (Timestamp(startdate) + dt.timedelta(n) for n in range(daterange)):
        fcr = pd.read_csv(fcrname+date_range.strftime('%m_%d_%Y')+ext, parse_dates = ['call_time'])
        data.append(fcr)
except IOError:
        print('File does not exist:', fcrname+date_range.strftime('%m_%d_%Y')+ext)
fcr = pd.concat(data)
print('FCR Done')                                                
try:
    for date_range in (Timestamp(enddate) - dt.timedelta(n) for n in range(3)):
        nps = pd.read_csv(npsname+date_range.strftime('%m_%d_%Y')+ext, parse_dates = ['call_date','date_completed'])
        frames.append(nps)
except IOError:
        print('File does not exist:', npsname+date_range.strftime('%m_%d_%Y')+ext)
nps = pd.concat(frames)
print('NPS Done')                
try:
    for date_range in (Timestamp(startdate) + dt.timedelta(n) for n in range(daterange)):
        vas = pd.read_csv(vasname+date_range.strftime('%m_%d_%Y')+ext, parse_dates = ['Call_date'])
        bracket.append(vas)
except IOError:
        print('File does not exist:', vasname+date_range.strftime('%m_%d_%Y')+ext)
vas = pd.concat(bracket)
print('VAS Done')                 
roster = pd.read_csv(loc+rostername)
print('Roster Done')
splits = pd.read_csv(loc+splitsname)
print('Splits Done')

Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. This is not a forum where you should keep the most updated version in your question. Please see what you may and may not do after receiving answers. — Simon Forsberg
– Simon Forsberg, Commented Sep 9, 2015 at 20:02

mkrieger1 · Accepted Answer · 2015-09-08 21:36:02Z

Use a class, or at least some functions, to make your code more readable and understandable

Very first reaction looking at your code is ....blech. I don't want to read that giant blob.

Why not make a class to bundle a bunch of functions together, such as a function readAndConcatAHT? Actually, many of these for loops are doing the exact same thing for slightly differently named files. Why not do something like a function that takes in a filename and then runs a for loop like so:

def readAndConcatFile(filename, daterange):
    try:
        for date_range in (Timestamp(startdate) + dt.timedelta(n) for n in range(daterange)):
            fcr = pd.read_csv(filename+date_range.strftime('%m_%d_%Y')+ext, parse_dates = ['call_time'])
            data.append(fcr)
    except IOError:
        print('File does not exist:', fcrname+date_range.strftime('%m_%d_%Y')+ext)

This would really clear out your code even if you elected not to write any other functions or a call. I think it's also fairer to your reader and yourself to respect DRY and not make readers check for themselves when you are doing something absolutely repetitive with slightly different function names.

I'll put an extra point to say I think a class would be nice because in your init or some processing function you could string together a bunch of calls to readAndConcatFile to standardize your read/write process for these CSV files. This will, again, make your code more extensible and more readable.

Avoid redundant import statements and stick with standards

Almost everyone uses import pandas as pd. I wouldn't recomend doing it any other way, and it's never a good idea to do a whole scale import *
Don't import glob unless you are actually using it. Where do you use glob after importing it?

Use special features only when you need them

Do you actually need raw strings? I don't see you using your strings in any way that would seem to require them.
Similarly, why use os.chdir when it could be smarter to specify filenames as absolute file names? Here you're again using an option you don't really need and that could have future unintended side effects.

Use more defined constants

It's not a good idea to hard code Timedelta(60 day) like so. You should separately specify DAY_RANGE = 60 as a constant and then use that wherever you'd use 60. That way you can easily change the day range. Alternately, you could make the day range an input parameter to your script so that non-programmer users can also call this script for their desired look-back period.
Similarly, you can save your desired date formats as strings to be treated as constants at the top of your file: format1 = "'%m_%d_%Y'" and format2 = "'%Y_%m_%d'" Again this makes it easier to see what's going on and also makes it easier to make changes in the future. You can change just one string at the top of your file to change all related formatting, rather than having to change each string. This won't make any given line of code shorter, but it will make it better.

More sophisticated error handling

Error handling is not something I do enough of myself, but I wonder if you can do better here. I'm going to assume that errors in ahtname are related, for example, to errors in fcrname. If that's the case, once you establish that a date range is missing for one kind of file, why not delete that daterange for all further queries in future loops? You could do so easily by simply deleting that member of daterange that causes the IOError. Then you wouldn't get repetitive error messages that are really all telling you the same thing.

What I liked

It's good practice to use generator expressions where you can, so I liked seeing code like for date_range in (Timestamp(startdate) + dt.timedelta(n) for n in range(daterange)):.

This response seems awesome. Already see how I can implement some things. I am completely unfamiliar with classes and such though.. I use pandas for data but never really learned basic python much. Yes the ioerror would be the same for all. It typically means that the file today is not available because of a timezone difference. — trench
– trench, Commented Sep 8, 2015 at 19:14
@LanceDacey glad it might be helpful. You can read about Python classes here: stackoverflow.com/questions/10004850/… The good news is that they're pretty straightforward, so even if you haven't worked with them before, it's pretty easy to write yourself a helpful class for organizational purposes. I'd say just think about member variables and member functions for now. Ignore other things people talk about until you've written a few classes. The answer in the linked post should be plenty to get you going on what you could do here. — sunny
– sunny, Commented Sep 8, 2015 at 19:21
I'll give you the best answer but I would like to modify my existing code first and paste the results just so I don't lose track of you/your feedback — trench
– trench, Commented Sep 9, 2015 at 16:19
@LanceDacey you are free to answer your own question, that's probably the best way. Or if you want a further round of reviews you can post new/updated code with a link back to this to show you are building on prior work. I look forward to seeing your answer. — sunny
– sunny, Commented Sep 9, 2015 at 16:29
@LanceDacey I didn't notice that before. Couldn't you make that a function parameter? — sunny
– sunny, Commented Sep 9, 2015 at 17:27

SuperBiasedMan · Accepted Answer · 2015-09-07 17:19:40Z

Why are you importing like this:

from pandas import *
import pandas as pd

It's a bad idea. import * is generally not a good idea, but in particular you now have a confusing import set up that allows you two ways to access the module. If you want to be able to call functions without using panda. or pd. then just use from pandas import function_name. Also you should lay out your functions neater than this. The style guide PEP0008 contains information on this and a ton of other readability stuff.

import pandas as pd
import numpy as np
import datetime as dt

import os

from glob import glob

You're using raw strings but not taking advantage of them? By prepending a string with r you don't need to escape a backslash. So you can just write the normal path.

os.chdir(r'C:\Users\Documents\FTP\')

Also there's a lot of stuff defined here that is not at all clear. I'd suggest using clearer names except I don't even know what names would be clearer. Adding some comments might help.

Indenting your except to match with inside the for loop might not raise any SyntaxError, but it is confusing. Move it back out to match to just be one indent in from the except. You should also be putting a space on either side of your operators, like your +s. Having no space makes it look like one long string and is much harder to read.

try:
    for date_range in (Timestamp(startdate) + dt.timedelta(n) for n in range(daterange)):
        aht = pd.read_csv(ahtname+date_range.strftime('%Y_%m_%d')+ext)
        calls.append(aht)
except IOError:
    print('File does not exist:', ahtname + date_range.strftime('%Y_%m_%d') + ext)
aht = pd.concat(calls)
print('AHT Done')

Thanks - made those changes. Do you have any recommendations based on actually importing and reading the list of files? the files are typically named something_something_%Y_%M_%D.csv. That's why I have a date range. — trench
– trench, Commented Sep 7, 2015 at 17:36
Are there other files in the folder you wouldn't want to read? A regular expression could be used to find all the files in a folder that match a pattern like a date, but I don't offhand know a good way to filter files you processed on a previous run of the script. — SuperBiasedMan
– SuperBiasedMan, Commented Sep 7, 2015 at 17:39
Also what in particular would you like to improve about how the files are imported/read? I might be misunderstanding you. — SuperBiasedMan
– SuperBiasedMan, Commented Sep 7, 2015 at 17:39
Yea there are probably 15 different file types downloaded from a ftp. I'm just interested in a few of them. The style above works, I was just wondering if there is anything I should do like define a function or class or something pythonic. I'll link the video I saw where my code looks a lot like the one where the instructor said 'don't do this' — trench
– trench, Commented Sep 7, 2015 at 17:46

Stack Exchange Network

Read daily files and concatenate them

2 Answers 2

Use a class, or at least some functions, to make your code more readable and understandable

Avoid redundant import statements and stick with standards

Use special features only when you need them

Use more defined constants

More sophisticated error handling

What I liked

You must log in to answer this question.

Linked

Hot Network Questions

Read daily files and concatenate them

2 Answers 2

Use a class, or at least some functions, to make your code more readable and understandable

Avoid redundant import statements and stick with standards

Use special features only when you need them

Use more defined constants

More sophisticated error handling

What I liked

You must log in to answer this question.

Linked

Related

Hot Network Questions