Pattern replacement in lines read from csv using regex in python 3+

Question

I have to work on malformed csv text files created by a proprietory software that changes some formats (quotation mark, column separator, single decimal place floats to four decimal places and the newline character). My target output is tab delimited, unix newline and floats with a single decimal place.

Here's some example lines from the original file:

1234\t5678\t-3461\t56\t10\n
4435.5\t-1261\t56\t10\n
89432\t678112\t-2461\t56\t10\n

This is the output of the proprietary software (it is not neccessarily a fixed width of 4 decimal floats, the developer might change this in future versions):

"1234.0000 5678.0000 -3461.0000 56.0000 10.0000"\r
"4435.5000 -1261.0000 56.0000 10.0000"\r
"89432.0000 678112.0000 -2461.0000 56.0000 10.0000"\r

My patterns in the function are very verbose, the regex could probably be written more condensed but as I am not yet very familiar with regex, I tend to keep the patterns simple to understand. Here's the function that I use to restructure each individual line of the csv file:

import re

def Filter(inputLine):
    line = inputLine.strip().lstrip("'").rstrip("'").lstrip('"').rstrip('"') #get rid of internal quotation marks and newline character
    line = re.sub(r'\s','\t', line) #replace whitespaces by tab
    line = re.sub(r'\.0{1,9}','', line) #replace all .0* by single whitespace -  not really working, .5000 for example - think!
    line = f'{line}\n'
    return(line)

#code for parsing each oline of input and so on

So far so good, but as expected, this does not change the 4435.5000 to 4435.5 in line 2:

1234\t5678\t-3461\t56\t10\n
4435.5000\t-1261\t56\t10\n
89432\t678112\t-2461\t56\t10\n

I would like to use regex for this task, if it is efficient even for large (>1GB) text files (I don't know if there is a more elegant solution to handling this operation).

What would the pattern be to maintain the .5, but remove all the .000? I was thinking along the following, but I got stuck with the substitution part:

    line = re.sub(r'\.[1-9]0{1,9}',r'\.[1-9]', line)

Which obviously does not work.

Is there any way to condense the regex pattern? This is more out of Interest, as I mentioned above, I currently prefer separate calls to help me understand the syntax of regex.

Any suggestions would be more than welcome!

Cheers Sacha

line = re.sub(r'(\.(\d*?))0+', lambda x: x.group(1) if x.group(2) else '', line) (demo) — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Nov 20, 2019 at 11:00
@WiktorStribiżew Wow! This is super efficient and a very cool combination of python and regex. Thank you very much, this solves my issue. — Sacha Viquerat
– Sacha Viquerat, Commented Nov 20, 2019 at 11:13

Wiktor Stribiżew · Accepted Answer · 2019-11-20 11:14:57Z

1

You may use

line = re.sub(r'(\.(\d*?))0+', lambda x: x.group(1) if x.group(2) else '', line)

See the Python demo.

It works like this:

(\.(\d*?)) - matches and captures into Group 1 a dot and 0 or more digits, but as few as possible, while capturing these digits into Group 2,
0+ - matches one or more 0 chars
lambda x: x.group(1) if x.group(2) else '' replaces the match with Group 1 contents if Group 2 is not empty, else, the whole match is removed.

answered Nov 20, 2019 at 11:14

Wiktor Stribiżew

631k41 gold badges502 silver badges633 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Pattern replacement in lines read from csv using regex in python 3+

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related