How to handle missing data in a csv file without using numpy/pandas?

Question

I'm trying to extract data from a csv file I have that contains some missing data

Num,Sym,Element,Group,Weight,Density,Melting,Boiling,Heat,Eneg,Radius,Oxidation
1,H,Hydrogen,1,1.008,0.00008988,14.01,20.28,14.304,2.2,53,"[1,-1]"
2,He,Helium,18,4.002602,0.0001785,0.956,4.22,5.193,No_Data,31,[0]
etc

In this case the missing value is the electronegativity of Helium, a noble gas. I also want to parse this data all at once (ie when I read it in) and cast it to the appropriate data type so I can perform calculations as needed, using this function

import csv

def read_periodic_table(): 
    per_table = {}
    with open("element_list.csv", "r") as f:
        my_reader = csv.reader(f)
        my_reader.next() # Just skipping the header
        try:
            while True:
                tl = my_reader.next()
                per_table[tl[1]] =(int(tl[0]), tl[2], int(tl[3]), float(tl[4]),
                                   float(tl[5]), float(tl[6]), float(tl[7]),
                                   float(tl[8]), float(tl[9]), float(tl[10]),
                                   list(tl[11]))

        except StopIteration:
            return

This works fine, except when there are places where there is no data (as above) and I get a TypeError. I get why there is an error - you can't really cast "No_Data" to a floating point number.

I've read these questions

which could probably answer my question, except I'd like to avoid using extra libraries for just one function.

The only way that I can think of handling this is some try/except blocks... a lot of them

Something like this

num = tl[0]
name = tl[2]
group = tl[3]
try:
    weight = float(tl[4])
except TypeError:
    weight = "No_Data"
finally:
    try:
        density = float(tl[5])
    except TypeError:
        density = "No_Data"
    finally:
        try:
            ...

Which, for what I hope are obvious reasons, I'd rather avoid. Is there a way using only the standard library to accomplish this? If the answer is - "No, not very easily/well" then that's fine, I'll just use numpy/pandas. I'd just like to avoid that if possible. Alternately, if there is a fantastic answer with numpy/pandas and a compelling reason why using an extra library wouldn't be bad I'd take that too.

The reason I don't want to use a third party library is that several people, including myself, will be working on this and then quite a few people will be using it afterwards. I'd rather not make them all install another library to make this work.

Marius · Accepted Answer · 2014-08-26 05:08:48Z

3

If I was absolutely determined to not use pandas, I'd do something like this:

Specify the type for each column
Write a quick conversion function to try out each conversion
Use a list comp/generator expression to call the conversion function on each cell

def convert_type(cell, typ):
    try:
        return typ(cell)
    except TypeError:
        return "No_Data"

# These lines go below 'tl = my_reader.next()' in your code
col_types = [int, str, int, float, float, float, float, float, float, float, float, list]
new_row = tuple(convert_type(cell, typ) for cell, typ in zip(tl, col_types))
per_table[tl[1]] = new_row

That said, if I was doing this myself, I would definitely use pandas. A distribution like Anaconda is a good option for getting Python set up quickly with lots of useful libraries like pandas already included.

answered Aug 26, 2014 at 5:08

Marius

60.6k16 gold badges115 silver badges108 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Marius Over a year ago

Sorry, just noticed that with your last column, you'd have to replace list with a custom function that converts the string representation of the list into an actual python list, the list() call won't do it so you'll just get `"No_Data" for all rows.

Dan Oberlam Over a year ago

Won't this also pass in the Sym value? I intended to exclude that because I'm using it as the dict key

Marius Over a year ago

Yes, sorry, it will pass in the Sym/tl[1] value. You could slice up new_row before putting it in the table or find some other way of mapping between column positions and conversion functions like a dict, the main idea I was trying to get across was having some kind of mapping that identifies the type of each column.

Darkoob12 · Accepted Answer · 2015-01-10 19:39:05Z

-2

i think the best way to import text data, with missing values, to python is numpy's genfromtxt function. it is very easy to use. in my case missing values are indicated by '?', you should use empty string ''.

 train = np.genfromtxt(path + 'cleveland.data', float, delimiter=',',missing_values='?',filling_values=np.nan)

answered Jan 10, 2015 at 19:39

Darkoob12

1142 silver badges8 bronze badges

2 Comments

Dan Oberlam Over a year ago

Like I said, I'm trying to avoid using numpy

Darkoob12 Over a year ago

sorry, i was reading multiple questions about missing values in python. i didn't read your question carefully.

Collectives™ on Stack Overflow

How to handle missing data in a csv file without using numpy/pandas?

2 Answers 2

3 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related