3

I'm working with a .csv file and, as always, it has format problems. In this case it's a ; separated table, but there's a row that sometimes has semicolons, like this:

code;summary;sector;sub_sector
1;fishes;2;2
2;agriculture; also fishes;1;2
3;fishing. Extraction;  animals;2;2

So there are three cases:

  • no semicolon -> no problem
  • word character(non-numeric), semicolon, whitespace, word character(non-numeric)
  • word character(non-numeric), semicolon, 2xwhitespace, word character(non-numeric)

I turned the .csv into a .txt and then imported it as a string and then I compiled this regex:

re.compile('([^\d\W]);\s+([^\d\W])', re.S)

Which should do. I almost managed to replace those semicolons for commas, doing the following:

def replace_comma(match):
    text = match.group()
    return text.replace(';', ',')

regex = re.compile('([^\d\W]);\s+([^\d\W])', re.S)

string2 = string.split('\n')

for n,i in enumerate(string2):
    if len(re.findall('([^\d\W]);(\s+)([^\d\W])', i))>=1:
        string2[n] = regex.sub(replace_comma, i)

This mostly works, but when there's two whitespaces after the semicolon, it leaves an \xa0 after the comma. I have two problems with this approach:

  • It's not very straightforward
  • Why is it leaving this \xa0 character ?

Do you know any better way to approach this?

Thanks

Edit: My desired output would be:

code;summary;sector;sub_sector
1;fishes;2;2
2;agriculture, also fishes;1;2
3;fishing. Extraction,  animals;2;2

Edit: Added explanation about turning the file into a string for better manipulation.

2
  • Is there just one specific "column" where the undesirable semicolons appear (i.e. column 2 in your example)? Or, can this issue occur in different "columns" for each row? Commented Jul 9, 2019 at 21:14
  • Just one column (the second one). I should add column names Commented Jul 9, 2019 at 21:15

1 Answer 1

2

For this case I wouldn't use regex, split() and rsplit() with maxpslit= parameter is enough:

data = '''1;fishes;2;2
2;agriculture; also fishes;1;2
3;fishing. Extraction;  animals;2;2'''

for line in data.splitlines():
    row = line.split(';', maxsplit=1)
    row = row[:1] + row[-1].rsplit(';', maxsplit=2)
    row[1] = row[1].replace(';', ',')
    print(';'.join(row))

Prints:

1;fishes;2;2
2;agriculture, also fishes;1;2
3;fishing. Extraction,  animals;2;2
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.