Replacing semicolon for comma in csv using regex in python

Question

I'm working with a .csv file and, as always, it has format problems. In this case it's a ; separated table, but there's a row that sometimes has semicolons, like this:

code;summary;sector;sub_sector
1;fishes;2;2
2;agriculture; also fishes;1;2
3;fishing. Extraction;  animals;2;2

So there are three cases:

no semicolon -> no problem
word character(non-numeric), semicolon, whitespace, word character(non-numeric)
word character(non-numeric), semicolon, 2xwhitespace, word character(non-numeric)

I turned the .csv into a .txt and then imported it as a string and then I compiled this regex:

re.compile('([^\d\W]);\s+([^\d\W])', re.S)

Which should do. I almost managed to replace those semicolons for commas, doing the following:

def replace_comma(match):
    text = match.group()
    return text.replace(';', ',')

regex = re.compile('([^\d\W]);\s+([^\d\W])', re.S)

string2 = string.split('\n')

for n,i in enumerate(string2):
    if len(re.findall('([^\d\W]);(\s+)([^\d\W])', i))>=1:
        string2[n] = regex.sub(replace_comma, i)

This mostly works, but when there's two whitespaces after the semicolon, it leaves an \xa0 after the comma. I have two problems with this approach:

It's not very straightforward
Why is it leaving this \xa0 character ?

Do you know any better way to approach this?

Thanks

Edit: My desired output would be:

code;summary;sector;sub_sector
1;fishes;2;2
2;agriculture, also fishes;1;2
3;fishing. Extraction,  animals;2;2

Edit: Added explanation about turning the file into a string for better manipulation.

Is there just one specific "column" where the undesirable semicolons appear (i.e. column 2 in your example)? Or, can this issue occur in different "columns" for each row? — benvc
– benvc, Commented Jul 9, 2019 at 21:14

Andrej Kesely · Accepted Answer · 2019-07-09 21:32:54Z

2

For this case I wouldn't use regex, split() and rsplit() with maxpslit= parameter is enough:

data = '''1;fishes;2;2
2;agriculture; also fishes;1;2
3;fishing. Extraction;  animals;2;2'''

for line in data.splitlines():
    row = line.split(';', maxsplit=1)
    row = row[:1] + row[-1].rsplit(';', maxsplit=2)
    row[1] = row[1].replace(';', ',')
    print(';'.join(row))

Prints:

1;fishes;2;2
2;agriculture, also fishes;1;2
3;fishing. Extraction,  animals;2;2

edited Jul 9, 2019 at 21:32

answered Jul 9, 2019 at 21:15

Andrej Kesely

196k15 gold badges60 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Replacing semicolon for comma in csv using regex in python

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related