How to extract data from a column with variable number of values in each row in a CSV file?

Question

Basically I'm trying to add together the count column values of item columns with the same name in a csv file. I then need to sort the results in ascending alphabetic order by item column values. For example:

Leading Cause, Deaths
Diabetes Mellitus, 123
Influenza and Pneumonia, 325
Diabetes Mellitus, 100

I need to add the values 123 and 100 to get a new row for Diabetes.

It should look like this:
Diabetes Mellitus, 223.

This is the code I have so far:

import csv
import sys

with open(sys.argv[1], 'r') as File:
    reader = csv.reader(File)
    itemindex = sys.argv[2]
    countindex = sys.argv[3]
    item column = 0
    count column = 0
    first row = True
    dictionary = {}

    for row in reader:
       if firstrow == True:
          firstrow = False
          itemcolumn = row.index(itemindex)
          countcolumn = row.index(countindex)
       else:
           if item column in dictionary:
               # Add the item at the row's count column (converted to an int) to the
               # prexisting entry with that item column.
           else:
               #create a new entry in the dictionary

       print(itemindex + "," + countindex)

for key, value in sorted(dictionary)
    print(key + "," + value)

The commented parts are the ones I'm stuck on.

akozi · Accepted Answer · 2018-03-14 02:33:39Z

1

If you have access you could use the pandas package to handle the csv.

Text file titled values.txt

Leading Cause, Deaths
Diabetes Mellitus, 123
Influenza and Pneumonia, 325
Diabetes Mellitus, 1008

The desired data frame can be achieved with:

import pandas as pd

data = pd.read_csv('values.txt')
print(data)

sum_data = data.groupby(['Leading Cause']).sum()
print(sum_data)

print(sum_data.loc['Diabetes Mellitus'])

Which would output

             Leading Cause   Deaths
0        Diabetes Mellitus      123
1  Influenza and Pneumonia      325
2        Diabetes Mellitus     1008

                          Deaths
Leading Cause                   
Diabetes Mellitus           1131
Influenza and Pneumonia      325

 Deaths    1131
Name: Diabetes Mellitus, dtype: int64

edited Mar 14, 2018 at 2:33

answered Mar 13, 2018 at 20:47

akozi

4551 gold badge8 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Cedric Zoppolo Over a year ago

pandas is a great package to manage data within csv files, and even for excel files.

martineau · Accepted Answer · 2018-03-14 01:53:49Z

Using a collections.defaultdict, which is a specialized dictionary class, would make it easy:

import collections
import csv
import os
import sys

try:
    filename = sys.argv[1]
    itemindex = int(sys.argv[2])
    countindex = int(sys.argv[3])
except IndexError:
    print('Error:\n  Usage: {} <file name> <item index> <count index>'.format(
            os.path.basename(sys.argv[0])))
    sys.exit(-1)

with open(filename, 'r', newline='') as file:
    reader = csv.reader(file, skipinitialspace=True)
    next(reader)  # Skip first row.

    counter = collections.defaultdict(int)
    for row in reader:
        disease, deaths = row[itemindex], int(row[countindex])
        counter[disease] += deaths

for key, value in sorted(counter.items()):
    print('{}, {}'.format(key, value))

Sample usage:

python3 script_name.py diseases.csv 0 1

Sample output:

Diabetes Mellitus, 223
Influenza and Pneumonia, 325

M T Head · Accepted Answer · 2018-03-13 20:44:16Z

0

You could try not using the libraries provided and just loop through the data as text. Parsing the text.

answered Mar 13, 2018 at 20:44

M T Head

1,3201 gold badge10 silver badges14 bronze badges

Collectives™ on Stack Overflow

How to extract data from a column with variable number of values in each row in a CSV file?

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related