0

I'm trying to extract anomalous data points from a large csv file (~1e6 lines) in which most of the data points are at constant value. I've written the code below to detect values lower than the constant.

constant = 1
try:
    fp = open('disk2.csv')
    for line in fp: 
        ch4 = float(line.split(",")[4]) #data from channel four is in the fifth column
        if ch4 < constant:
            print line.split(",")[0] #print first column

except:
    ch4 = 'Not found'
finally:
    fp.close()
    print(ch4,type(ch4))

the print returns the following, without additional errors:

('Not found', <type 'str'>)

if I change the code to:

constant = 1
try:
    fp = open('disk2.csv')
    for line in fp: 
        ch4 = line.split(",")[4] #data from channel four is in the fifth column
        if ch4 < constant:
            print line.split(",")[0] #print first column

except:
    ch4 = 'Not found'
finally:
    fp.close()
    print(ch4,type(ch4))

It returns

(' 2.41650E+01', <type 'str'>)

So, the csv file is read as a string, and the string can be divided into a list using the split command, but I cannot turn the items in the list into floating numbers?

The error was not in the code but in my CSV file, which did not contain enough items on the first row

4
  • You can change the string into a float using float_value = float(ch4) Commented Oct 26, 2018 at 12:54
  • This doesn't directly answer your question so I'm not including it as an answer, but you might take a look at using the pandas library if you'll be working much with csv data. This could be done in 2 lines with the first being reading the file into a DataFrame and the second showing all rows with value less than constant. Commented Oct 26, 2018 at 18:11
  • would it also work for a CSV file with millions of lines? Commented Oct 26, 2018 at 19:27
  • pandas would load the whole thing into memory, so as long as you have enough memory you should be fine. There's another library called dask that uses the pandas API but allows for using data sets that don't fit into memory, but I've never used it myself. Commented Oct 26, 2018 at 19:51

2 Answers 2

1

It's generally a bad practice to directly compare floats. it's better to use something like this:

abs(float(ch4), constant) <= allowed_error

Where allowed_error is some small value like 0.000001, for example. Floating point numbers are stored differently from integers and 1.0 can internally be 0.9999999 or 1.000001.

Sign up to request clarification or add additional context in comments.

8 Comments

In case where they need to be ranked, is there some way to generate the value of allowed_error for 16 decimal places, for example?
You can use sys.float_info.epsilon for that, I believe.
This looks like a way to find small differences between numbers. I'm actually looking for a bigger difference, but the allowed_error value could be used to tune that. Anyway, I changed line 6 to abs(float(ch4),constant) <= allowed_error with allowed_error set to 0.1 and it generated the same result (i.e. the for loop is failing)
Can you show a sample of your file? A couple of lines? Can you temporarily change constant to something that's definitely bigger than some values in that sample and try again?
Most lines look like this 1.320460000E+04, 2.41900E+01, 2.41900E+01, 2.41900E+01, 2.41900E+01, 2.41900E+01, 2.50000E-02, 2.00000E-02, 2.40000E-01, 1.00000E-02,-2.36750E+01,0, \n
|
0

In the first case, you are doing the comparison with the values, and changing the format from str to float for the comparison, as in if float(ch4) < constant. Note that you are not storing the value as a float type, but just converting it right there for this particular evaluation.

In the second case, you are comparing a str and an int. Notice that when you use constant = 1, the type for constant by default is int, and not float. Having said that, you are comparing an int and a str. For this evaluation, your code would compare the values by encoding the string as such into int. For example, in ASCII, 'A' would be encoded as 65. The string would be converted into the integer representation, depending on the encoding used, and then would be used for the evaluation.

To solve your problem, you must store the value in ch4 as a float. This can be done by ch4 = float(line.split(",")[4]) which will store the value in a float variable, as opposed to the str variable.

6 Comments

If I change line 5 to ch4 = float(line.split(",")[4] then I get the same result as in the first example. Do you have an example of how it would work?
you need to end the braces well. Are you sure it's not a typo? In the block itself, you can do a print(type(ch4)) to verify if it works
when I request the type inside the for loop, it works when the code is ch4 = line.split(",")[4] but not when the code is ch4 = float(line.split(",")[4]) (indeed above there was a typo, but that was not in the real code.)
According to the numbers you provided in another answer, and when I evaluated them with ch4 = float(blah), and it displays the type(ch4) as float. The condition says that the number is supposed to be less than 1, and so it displays the part in finally block. The number in question for the evaluation is 2.41900E+01. If i change the value of constant to 25, it displays the first column as a string (we didn't typecast it), and the final value of ch4, and it's type. Isn't that how it's supposed to work?
You are right and it turned out I had missed that the top row of my file was unsuited for this way of reading out...
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.