I'm working with a .csv file and, as always, it has format problems. In this case it's a ; separated table, but there's a row that sometimes has semicolons, like this:
code;summary;sector;sub_sector
1;fishes;2;2
2;agriculture; also fishes;1;2
3;fishing. Extraction; animals;2;2
So there are three cases:
- no semicolon -> no problem
- word character(non-numeric), semicolon, whitespace, word character(non-numeric)
- word character(non-numeric), semicolon, 2xwhitespace, word character(non-numeric)
I turned the .csv into a .txt and then imported it as a string and then I compiled this regex:
re.compile('([^\d\W]);\s+([^\d\W])', re.S)
Which should do. I almost managed to replace those semicolons for commas, doing the following:
def replace_comma(match):
text = match.group()
return text.replace(';', ',')
regex = re.compile('([^\d\W]);\s+([^\d\W])', re.S)
string2 = string.split('\n')
for n,i in enumerate(string2):
if len(re.findall('([^\d\W]);(\s+)([^\d\W])', i))>=1:
string2[n] = regex.sub(replace_comma, i)
This mostly works, but when there's two whitespaces after the semicolon, it leaves an \xa0 after the comma. I have two problems with this approach:
- It's not very straightforward
- Why is it leaving this
\xa0character ?
Do you know any better way to approach this?
Thanks
Edit: My desired output would be:
code;summary;sector;sub_sector
1;fishes;2;2
2;agriculture, also fishes;1;2
3;fishing. Extraction, animals;2;2
Edit: Added explanation about turning the file into a string for better manipulation.