0

I have a csv file with variable column widths where each row is a date, a region, and a set of ID’s that were observed at that date/region.

The data look like this:

12/01/2017,Region1,BMW_123,TESLA_332,TESLA_2002
11/07/2017,Region2,TESLA_332,BMW_123,TESLA_2002,TESLA_99812
11/19/2017,Region2,BMW_123,TESLA_31
10/23/2017,Region1,BMW_4,TESLA_3,TESLA_90
11/02/2017,Region2,TESLA_28,BMW_56,TESLA_22,TESLA_821
10/14/2017,Region2,BMW_1,BMW_8,BMW_2,TESLA_3,TESLA_4,TESLA_99,TESLA_81,TESLA_56

I am trying to:

(1) Split all of the ID’s and just keep the brand, so “TESLA_12345” would become just “TESLA”. This part is no problem.

(2) Aggregate by month-region, so that we know the count of failures for each brand (e.g. Tesla or BMW) for each month-region tuple.

The output should look something like this (format is not super important – it just has to be legible):

month region BMW TESLA
10 Region1   1     2
12 Region1   1     2
10 Region2   3     5
11 Region2   3     7

It seems like a natural way to do this would be to read the CSV line-by-line and update counts for each brand in nested dictionaries (i.e. have a regions dict containing 2 regions, each of which contains month dicts that keep a count such as {BMW:2, TESLA: 1}. However, I’m struggling with updating the nested dicts, and I wonder if there is a much simpler solution (or if Pandas can do this effortlessly, etc.)

(note: obviously, the month can be extracted from the date with:

datetime.strptime(mydate, "%m/%d/%Y").strftime("%m")

)

3

1 Answer 1

4

Here's one approach. I wouldn't call it pretty, but it gets the job done.

  1. The first problem is the different number of fields in each row. You can read in your file one line at a time, and store it in a list, data. You can also trim off the car IDs while you're doing this:

    import pandas as pd
    
    # assuming CSV is named test.csv
    f = open("test.csv", "r")
    
    data = []
    for i, line in enumerate(f.readlines()):
        splitted = line.split(",")
        just_brand = [x.split("_")[0] for x in splitted]
        data.append(just_brand)
    
  2. Now that we have the file read into a Python data structure, we can reorder the file lines such that the entry with the greatest number of fields is on the top. This is good for Pandas, as it can handle missing columns much better than extra columns. If we start out with the greatest number of columns, shorter subsequent rows will be handled gracefully.

    df = pd.DataFrame(sorted(data, key=lambda row: len(row), reverse=True))
    
    df
                0        1      2      3      4      5      6      7      8      9
    0  10/14/2017  Region2    BMW    BMW    BMW  TESLA  TESLA  TESLA  TESLA  TESLA
    1  11/07/2017  Region2  TESLA    BMW  TESLA  TESLA   None   None   None   None
    2  11/02/2017  Region2  TESLA    BMW  TESLA  TESLA   None   None   None   None
    3  12/01/2017  Region1    BMW  TESLA  TESLA   None   None   None   None   None
    4  10/23/2017  Region1    BMW  TESLA  TESLA   None   None   None   None   None
    5  11/19/2017  Region2    BMW  TESLA   None   None   None   None   None   None
    
  3. From here, it's just a matter of getting things named, organized, and formatted.

    df = (df.set_index([0,1])
            .stack()
            .reset_index(level=1)
            .rename(columns={1:"region",0:"make"})
            .reset_index(level=1, drop=True))
    
    df = (df.groupby([pd.to_datetime(df.index).month,"region","make"])
            .make.count()
            .unstack()
            .reset_index()
            .rename(columns={0:"month"}))
    df.columns.name = ""
    
    df
        region  BMW  TESLA  month
    0  Region1    1      2     10
    1  Region2    3      5     10
    2  Region2    3      7     11
    3  Region1    1      2     12
    
  4. (alternate, less gymnastic)

    # get TESLA, BMW counts for each row
    cts = df.iloc[:,2:].apply(lambda x: x.value_counts(), axis=1)
    # merge with date, region
    df2 = pd.concat([df.iloc[:, :2], cts], axis=1)
    # groupby and sum
    (df2.groupby([pd.to_datetime(df[0]).dt.month,1])
        .sum()
        .reset_index()
        .rename(columns={0:"month",1:"region"}))
    
Sign up to request clarification or add additional context in comments.

2 Comments

Wouldn't it be better if you made two int columns, "TESLA" and "BMW", and store how many times each one occurred in each line? This makes further processing much easier.
Qusai Alothman, I added a solution based on your suggestion. Thanks!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.