importing weird string csv as float

Question

U_d = pd.read_csv("Uzud.csv", decimal = ',')
print(U_d)

what comes out:

     f in kHz  U in mV  Amplitude in Vpp  entfernung in cm
43.7       52        1               5.0               9.0
43.7       43        5              11.0               NaN
43.7       30        5              13.0               NaN
43.7     27,5        5              15.0               NaN
43.7       24        5              16.0               NaN

(Note that the value 27.5 in 'U in mV' is 27,5 and interpreted as a string.)

In all columns all data are actually floats but unfortunately it is stored like this:

f in kHz,U in mV,amplitude in Vpp,distance in cm
"43,7",52,1,5,9
"43,7",43,5,11
"43,7",30,5,13
"43,7", "27,5",5,15
"43,7",24,5,16

which causes some data in 'U in mV' to be interpreted as a string. I want everything to be stored as a float.

ouroboros1 · Accepted Answer · 2024-04-25 15:14:10Z

Use pd.read_csv with skipinitialspace=True. Add dtype=float, if integers need to become floats as well. For "bad lines" correction, see below.

import pandas as pd
from io import StringIO

s = """f in kHz,U in mV,amplitude in Vpp,distance in cm
"43,7",52,1,5,9
"43,7",43,5,11
"43,7",30,5,13
"43,7", "27,5",5,15
"43,7",24,5,16
"""

df = pd.read_csv(StringIO(s), decimal=',', skipinitialspace=True, dtype=float)

Output

      f in kHz  U in mV  amplitude in Vpp  distance in cm
43.7      52.0      1.0               5.0             9.0
43.7      43.0      5.0              11.0             NaN
43.7      30.0      5.0              13.0             NaN
43.7      27.5      5.0              15.0             NaN
43.7      24.0      5.0              16.0             NaN

As noted in the answer by @AutumnKome, your first line appears to be flawed:

"43,7",52,1,5,9`
# should probably be 
"43,7","52,1",5,9

Since pd 1.4.0 you can try and handle such cases, passing a callable to on_bad_lines.

It's a bit annoying that the first error appears in the first non-header row, leading to the issue discussed in this SO post, with submitted ticket here.

As a workaround, use header=None, which will push the headers to row 1, and correct it afterwards.

bad_rows = []

def handle_bad_lines(row):
    new_row = row[:1] + [float('.'.join(row[1:3]))] + row[3:]
    # ['43.7', '52', '1', '5', '9'] ->
    # ['43.7', 52.1, '5', '9']
    return new_row

df = pd.read_csv(StringIO(s), decimal=',', 
                 skipinitialspace=True, 
                 header=None,
                 on_bad_lines=handle_bad_lines, 
                 engine='python') # with `pd 2.2.0`, use `engine='pyarrow'`

df.head(3)

          0        1                 2               3
0  f in kHz  U in mV  amplitude in Vpp  distance in cm # <header pushed to row 1
1      43.7     52.1                 5               9
2      43.7       43                 5              11

Correcting the header, index, and adjusting the dtype (using df.T, df.set_index, df.rename_axis, df.reset_index and df.astype).

df = (
    df.T
    .set_index(0)
    .T
    .rename_axis(columns=None)
    .reset_index(drop=True)
    .astype(float)
)

Output (bad lines corrected, header restored)

   f in kHz  U in mV  amplitude in Vpp  distance in cm
0      43.7     52.1               5.0             9.0
1      43.7     43.0               5.0            11.0
2      43.7     30.0               5.0            13.0
3      43.7     27.5               5.0            15.0
4      43.7     24.0               5.0            16.0

If your actual data has more complicated bad lines, first maybe collect them and investigate how they can best be handled. E.g.,

bad_rows = []

def handle_bad_lines(row):
    global bad_rows
    bad_rows.append(row)
    return None # returning `None` is effectively skipping the line

# after executing `df = pd.read_csv(...)`
print(bad_rows)
# [['43.7', '52', '1', '5', '9']]

AutumnKome · Accepted Answer · 2024-04-25 14:26:18Z

-1

In case you don't have access to pandas (like me), you're going to have to make a few subroutines first. One to split the line on regular expressions (regex), and the second to convert values to float.

import re
import string
from collections import OrderedDict
import sys

# Inputting the filepath via 1st CL argument here.
# This is a placeholder for however you are inputting your user arguments .
# I would normally pass sys.argv[1:] to the parser module,
# but this is quick and dirty script to verify that it works
fpath = sys.argv[1:][0] 

def split_line(line):
    # r'' means the text is a "regular expression"
    return re.sub(r'"(\w+),(\w+)"', r'\1.\2', line).split(',')

def try_make_float(value):
    if isinstance(value, str):
        # printable = digits + ascii_letters + punctuation + whitespace
        # This is to catch any leading/trailing non-numeric characters
        stripchars = value.strip(string.printable)
        value = value.strip(stripchars)
        # Catch NaN and Hex values:
        if value.upper() != value.lower():
            return None # or can return a NaN value instead
        value = float(value)
        return value
    return value

# Initialize ordered dict:
read_data = OrderedDict()
keys = ["f in kHz","U in mV","amplitude in Vpp","distance in cm"]
for key in keys:
    read_data[key] = []

with open(fpath, 'r') as file:
    file.readline() # skip header
    while True:
        line = file.readline().strip()
        if line == '':
            break
        line = split_line(line)
        for value, key in zip(line, keys):
            read_data[key].append(try_make_float(value))

I also think you may have an error in that first line in the data file where quotes are missing? The other lines all have 4 columns each.

Should the first line of data be:

"43,7","52,1",5,9

edited Apr 25, 2024 at 14:26

answered Apr 25, 2024 at 13:44

AutumnKome

14912 bronze badges

3 Comments

jackal Over a year ago

I lost confidence as soon as I saw sys.argv[1:][0] Also, the requirement is for a properly constructed DataFrame for which pandas is essential

AutumnKome Over a year ago

That is a placeholder for whatever the import method is, I will add that to the comment. Usually I would pass to parser. I don't have access to pandas in my dev environment due to restrictions on python versioning.

AutumnKome Over a year ago

This morning I remembered that I would still read to dict lists even with pandas or numpy, as the read-in is much faster directly to list. This may not be necessary for smaller datasets. The datasets I use can be 10s of millions of rows of multiple columns, so read-in speed is pretty important. After read-in I would convert the lists to whatever the needed data format is, for example a numpy array.

Collectives™ on Stack Overflow

importing weird string csv as float

2 Answers 2

1 Comment

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related