Python (Pandas?): Aggregate Data in CSV File with Variable Column Widths

Question

I have a csv file with variable column widths where each row is a date, a region, and a set of ID’s that were observed at that date/region.

The data look like this:

12/01/2017,Region1,BMW_123,TESLA_332,TESLA_2002
11/07/2017,Region2,TESLA_332,BMW_123,TESLA_2002,TESLA_99812
11/19/2017,Region2,BMW_123,TESLA_31
10/23/2017,Region1,BMW_4,TESLA_3,TESLA_90
11/02/2017,Region2,TESLA_28,BMW_56,TESLA_22,TESLA_821
10/14/2017,Region2,BMW_1,BMW_8,BMW_2,TESLA_3,TESLA_4,TESLA_99,TESLA_81,TESLA_56

I am trying to:

(1) Split all of the ID’s and just keep the brand, so “TESLA_12345” would become just “TESLA”. This part is no problem.

(2) Aggregate by month-region, so that we know the count of failures for each brand (e.g. Tesla or BMW) for each month-region tuple.

The output should look something like this (format is not super important – it just has to be legible):

month region BMW TESLA
10 Region1   1     2
12 Region1   1     2
10 Region2   3     5
11 Region2   3     7

It seems like a natural way to do this would be to read the CSV line-by-line and update counts for each brand in nested dictionaries (i.e. have a regions dict containing 2 regions, each of which contains month dicts that keep a count such as {BMW:2, TESLA: 1}. However, I’m struggling with updating the nested dicts, and I wonder if there is a much simpler solution (or if Pandas can do this effortlessly, etc.)

(note: obviously, the month can be extracted from the date with:

datetime.strptime(mydate, "%m/%d/%Y").strftime("%m")

)

andrew_reece · Accepted Answer · 2017-12-19 16:18:20Z

Here's one approach. I wouldn't call it pretty, but it gets the job done.

The first problem is the different number of fields in each row. You can read in your file one line at a time, and store it in a list, data. You can also trim off the car IDs while you're doing this:

import pandas as pd

# assuming CSV is named test.csv
f = open("test.csv", "r")

data = []
for i, line in enumerate(f.readlines()):
    splitted = line.split(",")
    just_brand = [x.split("_")[0] for x in splitted]
    data.append(just_brand)

Now that we have the file read into a Python data structure, we can reorder the file lines such that the entry with the greatest number of fields is on the top. This is good for Pandas, as it can handle missing columns much better than extra columns. If we start out with the greatest number of columns, shorter subsequent rows will be handled gracefully.

df = pd.DataFrame(sorted(data, key=lambda row: len(row), reverse=True))

df
            0        1      2      3      4      5      6      7      8      9
0  10/14/2017  Region2    BMW    BMW    BMW  TESLA  TESLA  TESLA  TESLA  TESLA
1  11/07/2017  Region2  TESLA    BMW  TESLA  TESLA   None   None   None   None
2  11/02/2017  Region2  TESLA    BMW  TESLA  TESLA   None   None   None   None
3  12/01/2017  Region1    BMW  TESLA  TESLA   None   None   None   None   None
4  10/23/2017  Region1    BMW  TESLA  TESLA   None   None   None   None   None
5  11/19/2017  Region2    BMW  TESLA   None   None   None   None   None   None

From here, it's just a matter of getting things named, organized, and formatted.

df = (df.set_index([0,1])
        .stack()
        .reset_index(level=1)
        .rename(columns={1:"region",0:"make"})
        .reset_index(level=1, drop=True))

df = (df.groupby([pd.to_datetime(df.index).month,"region","make"])
        .make.count()
        .unstack()
        .reset_index()
        .rename(columns={0:"month"}))
df.columns.name = ""

df
    region  BMW  TESLA  month
0  Region1    1      2     10
1  Region2    3      5     10
2  Region2    3      7     11
3  Region1    1      2     12

(alternate, less gymnastic)

# get TESLA, BMW counts for each row
cts = df.iloc[:,2:].apply(lambda x: x.value_counts(), axis=1)
# merge with date, region
df2 = pd.concat([df.iloc[:, :2], cts], axis=1)
# groupby and sum
(df2.groupby([pd.to_datetime(df[0]).dt.month,1])
    .sum()
    .reset_index()
    .rename(columns={0:"month",1:"region"}))

Wouldn't it be better if you made two int columns, "TESLA" and "BMW", and store how many times each one occurred in each line? This makes further processing much easier.
Qusai Alothman, I added a solution based on your suggestion. Thanks!

Collectives™ on Stack Overflow

Python (Pandas?): Aggregate Data in CSV File with Variable Column Widths

1 Answer 1

2 Comments

Your Answer

Collectives™ on Stack Overflow

Python (Pandas?): Aggregate Data in CSV File with Variable Column Widths

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest