14

The following code is part of a bigger project. In my project I have to read a large text file, with probably many million lines, with each line having a pair of decimals separated by space.

An example is the following:

-0.200000 -1.000000
-0.469967 0.249733
-0.475169 -0.314739
-0.086706 -0.901599

Until now I used a custom made parser, created by me, which worked fine but it was not the fastest one. Searching online I found numpy's loadtxt and pandas read_csv. The first one worked great but it's speed was even worse than mine. The second one was pretty fast but I was getting errors later in my project (I solve some PDEs with finite element method and while reading the coordinates with either my parser or loadtxt I get the correct result, when I use read_csv the matrix A of the system Ax=b becomes singular).

So I created this test code to see what's going on:

import numpy as np
import pandas as pd

points_file = './points.txt'

points1 = pd.read_csv(points_file, header=None, sep='\s+', dtype=np.float64).values
points2 = np.loadtxt(points_file, dtype=np.float64)

if (np.array_equal(points1, points2)):
    print ('Equal')
else:
    print ('Not Equal')

for i in range(len(points1)):
    print (points1[i] == points2[i])

Surprisingly the output was:

Not Equal
[ True  True]
[ True False]
[False  True]
[False False]

Already quite confused, I continued searching and I found this function from user "Dan Lecocq" to get the binary representation of the numbers.

So for the 2nd number in the 2nd line (0.249733) the binary representation from read_csv and loadtxt was respectively:

0011111111001111111101110100000000111101110111011011000100100000
0011111111001111111101110100000000111101110111011011000100100001

and the decimal values:

2.49732999999999982776444085175E-1
2.49733000000000010532019700804E-1

Why is this happening? I mean, I read the same string from a text file and I save it in memory as the same data type. I would also love to understand why this small difference affects so much my solution but that involves showing you around 1000 lines of my messy code. I first need to create more test codes to find exactly where is the problem.

Software versions:

Ubuntu 16.04 64bit
Python: 2.7.12
Numpy: 1.11.0
Pandas: 0.18.0
4
  • 5
    Pandas has its own decimal-float parsing functions for the purpose of speed. They sometimes do not give the most accurate floating point representations of the decimal inputs. Commented Jul 18, 2016 at 21:18
  • 9
    We are always telling new programmers - don't worry about those extra digits off at the end. Floating point representation of 0.249733 is inherently imprecise. The difference between those 2 numbers is 2**-55. np.allclose returns True. Commented Jul 18, 2016 at 21:57
  • 3
    Seems like a fair question from someone who wants to understand: "Why is this happening?" Commented Aug 1, 2021 at 17:17
  • 1
    It's worth noting that this no longer happens in Python 3 and current versions of numpy and pandas Commented Feb 10, 2022 at 23:04

1 Answer 1

1

I would ask myself the following question: How much precision is needed in my project?

I would suggest to use pandas or numpy round(), if you can afford to lose some digits.

Keep in mind float processing is always finicky. Useful resources might be: correcting for floating point arithmetic 'errors' when rounding in pandas

, or if you know nothing about float representation: https://docs.python.org/3/tutorial/floatingpoint.html

Sign up to request clarification or add additional context in comments.

1 Comment

Please, try your solution through representation of the coding results.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.