1

I'm trying to extract data from a csv file I have that contains some missing data

Num,Sym,Element,Group,Weight,Density,Melting,Boiling,Heat,Eneg,Radius,Oxidation
1,H,Hydrogen,1,1.008,0.00008988,14.01,20.28,14.304,2.2,53,"[1,-1]"
2,He,Helium,18,4.002602,0.0001785,0.956,4.22,5.193,No_Data,31,[0]
etc

In this case the missing value is the electronegativity of Helium, a noble gas. I also want to parse this data all at once (ie when I read it in) and cast it to the appropriate data type so I can perform calculations as needed, using this function

import csv

def read_periodic_table(): 
    per_table = {}
    with open("element_list.csv", "r") as f:
        my_reader = csv.reader(f)
        my_reader.next() # Just skipping the header
        try:
            while True:
                tl = my_reader.next()
                per_table[tl[1]] =(int(tl[0]), tl[2], int(tl[3]), float(tl[4]),
                                   float(tl[5]), float(tl[6]), float(tl[7]),
                                   float(tl[8]), float(tl[9]), float(tl[10]),
                                   list(tl[11]))

        except StopIteration:
            return

This works fine, except when there are places where there is no data (as above) and I get a TypeError. I get why there is an error - you can't really cast "No_Data" to a floating point number.

I've read these questions

which could probably answer my question, except I'd like to avoid using extra libraries for just one function.

The only way that I can think of handling this is some try/except blocks... a lot of them

Something like this

num = tl[0]
name = tl[2]
group = tl[3]
try:
    weight = float(tl[4])
except TypeError:
    weight = "No_Data"
finally:
    try:
        density = float(tl[5])
    except TypeError:
        density = "No_Data"
    finally:
        try:
            ...

Which, for what I hope are obvious reasons, I'd rather avoid. Is there a way using only the standard library to accomplish this? If the answer is - "No, not very easily/well" then that's fine, I'll just use numpy/pandas. I'd just like to avoid that if possible. Alternately, if there is a fantastic answer with numpy/pandas and a compelling reason why using an extra library wouldn't be bad I'd take that too.

The reason I don't want to use a third party library is that several people, including myself, will be working on this and then quite a few people will be using it afterwards. I'd rather not make them all install another library to make this work.

2 Answers 2

3

If I was absolutely determined to not use pandas, I'd do something like this:

  • Specify the type for each column
  • Write a quick conversion function to try out each conversion
  • Use a list comp/generator expression to call the conversion function on each cell

def convert_type(cell, typ):
    try:
        return typ(cell)
    except TypeError:
        return "No_Data"

# These lines go below 'tl = my_reader.next()' in your code
col_types = [int, str, int, float, float, float, float, float, float, float, float, list]
new_row = tuple(convert_type(cell, typ) for cell, typ in zip(tl, col_types))
per_table[tl[1]] = new_row

That said, if I was doing this myself, I would definitely use pandas. A distribution like Anaconda is a good option for getting Python set up quickly with lots of useful libraries like pandas already included.

Sign up to request clarification or add additional context in comments.

3 Comments

Sorry, just noticed that with your last column, you'd have to replace list with a custom function that converts the string representation of the list into an actual python list, the list() call won't do it so you'll just get `"No_Data" for all rows.
Won't this also pass in the Sym value? I intended to exclude that because I'm using it as the dict key
Yes, sorry, it will pass in the Sym/tl[1] value. You could slice up new_row before putting it in the table or find some other way of mapping between column positions and conversion functions like a dict, the main idea I was trying to get across was having some kind of mapping that identifies the type of each column.
-2

i think the best way to import text data, with missing values, to python is numpy's genfromtxt function. it is very easy to use. in my case missing values are indicated by '?', you should use empty string ''.

 train = np.genfromtxt(path + 'cleveland.data', float, delimiter=',',missing_values='?',filling_values=np.nan)

2 Comments

Like I said, I'm trying to avoid using numpy
sorry, i was reading multiple questions about missing values in python. i didn't read your question carefully.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.