1

I have to work on malformed csv text files created by a proprietory software that changes some formats (quotation mark, column separator, single decimal place floats to four decimal places and the newline character). My target output is tab delimited, unix newline and floats with a single decimal place.

Here's some example lines from the original file:

1234\t5678\t-3461\t56\t10\n
4435.5\t-1261\t56\t10\n
89432\t678112\t-2461\t56\t10\n

This is the output of the proprietary software (it is not neccessarily a fixed width of 4 decimal floats, the developer might change this in future versions):

"1234.0000 5678.0000 -3461.0000 56.0000 10.0000"\r
"4435.5000 -1261.0000 56.0000 10.0000"\r
"89432.0000 678112.0000 -2461.0000 56.0000 10.0000"\r

My patterns in the function are very verbose, the regex could probably be written more condensed but as I am not yet very familiar with regex, I tend to keep the patterns simple to understand. Here's the function that I use to restructure each individual line of the csv file:

import re

def Filter(inputLine):
    line = inputLine.strip().lstrip("'").rstrip("'").lstrip('"').rstrip('"') #get rid of internal quotation marks and newline character
    line = re.sub(r'\s','\t', line) #replace whitespaces by tab
    line = re.sub(r'\.0{1,9}','', line) #replace all .0* by single whitespace -  not really working, .5000 for example - think!
    line = f'{line}\n'
    return(line)

#code for parsing each oline of input and so on

So far so good, but as expected, this does not change the 4435.5000 to 4435.5 in line 2:

1234\t5678\t-3461\t56\t10\n
4435.5000\t-1261\t56\t10\n
89432\t678112\t-2461\t56\t10\n

I would like to use regex for this task, if it is efficient even for large (>1GB) text files (I don't know if there is a more elegant solution to handling this operation).

  1. What would the pattern be to maintain the .5, but remove all the .000? I was thinking along the following, but I got stuck with the substitution part:
    line = re.sub(r'\.[1-9]0{1,9}',r'\.[1-9]', line) 

Which obviously does not work.

  1. Is there any way to condense the regex pattern? This is more out of Interest, as I mentioned above, I currently prefer separate calls to help me understand the syntax of regex.

Any suggestions would be more than welcome!

Cheers Sacha

2
  • 1
    line = re.sub(r'(\.(\d*?))0+', lambda x: x.group(1) if x.group(2) else '', line) (demo) Commented Nov 20, 2019 at 11:00
  • @WiktorStribiżew Wow! This is super efficient and a very cool combination of python and regex. Thank you very much, this solves my issue. Commented Nov 20, 2019 at 11:13

1 Answer 1

1

You may use

line = re.sub(r'(\.(\d*?))0+', lambda x: x.group(1) if x.group(2) else '', line)

See the Python demo.

It works like this:

  • (\.(\d*?)) - matches and captures into Group 1 a dot and 0 or more digits, but as few as possible, while capturing these digits into Group 2,
  • 0+ - matches one or more 0 chars
  • lambda x: x.group(1) if x.group(2) else '' replaces the match with Group 1 contents if Group 2 is not empty, else, the whole match is removed.
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.