I have to work on malformed csv text files created by a proprietory software that changes some formats (quotation mark, column separator, single decimal place floats to four decimal places and the newline character). My target output is tab delimited, unix newline and floats with a single decimal place.
Here's some example lines from the original file:
1234\t5678\t-3461\t56\t10\n
4435.5\t-1261\t56\t10\n
89432\t678112\t-2461\t56\t10\n
This is the output of the proprietary software (it is not neccessarily a fixed width of 4 decimal floats, the developer might change this in future versions):
"1234.0000 5678.0000 -3461.0000 56.0000 10.0000"\r
"4435.5000 -1261.0000 56.0000 10.0000"\r
"89432.0000 678112.0000 -2461.0000 56.0000 10.0000"\r
My patterns in the function are very verbose, the regex could probably be written more condensed but as I am not yet very familiar with regex, I tend to keep the patterns simple to understand. Here's the function that I use to restructure each individual line of the csv file:
import re
def Filter(inputLine):
line = inputLine.strip().lstrip("'").rstrip("'").lstrip('"').rstrip('"') #get rid of internal quotation marks and newline character
line = re.sub(r'\s','\t', line) #replace whitespaces by tab
line = re.sub(r'\.0{1,9}','', line) #replace all .0* by single whitespace - not really working, .5000 for example - think!
line = f'{line}\n'
return(line)
#code for parsing each oline of input and so on
So far so good, but as expected, this does not change the 4435.5000 to 4435.5 in line 2:
1234\t5678\t-3461\t56\t10\n
4435.5000\t-1261\t56\t10\n
89432\t678112\t-2461\t56\t10\n
I would like to use regex for this task, if it is efficient even for large (>1GB) text files (I don't know if there is a more elegant solution to handling this operation).
- What would the pattern be to maintain the .5, but remove all the .000? I was thinking along the following, but I got stuck with the substitution part:
line = re.sub(r'\.[1-9]0{1,9}',r'\.[1-9]', line)
Which obviously does not work.
- Is there any way to condense the regex pattern? This is more out of Interest, as I mentioned above, I currently prefer separate calls to help me understand the syntax of regex.
Any suggestions would be more than welcome!
Cheers Sacha
line = re.sub(r'(\.(\d*?))0+', lambda x: x.group(1) if x.group(2) else '', line)(demo)