REGEX for complex strings

Question

I have the lines from a CSV file:

315,"Misérables, Les (1995)",Drama|War
315,Big Bully (1996),Comedy|Drama

I want to split the line and make a list of 3 elements and I need a general REGEX expression that splits where it encounters ',' but since the title may have a comma (As shown in the first line), I need to skip the parsing of the title. A title that has commas has also quotation marks but I need the expression to work for both cases. Is it possible doing it with REGEX?

I'm trying to learn REGEX by myself and I'm having difficulties understanding some cases. I could really appreciate your help!

by parenthesis do you mean quotation marks? ""? If so, you can first try to split the string by them, if you get 3 items, you are done, if not, split by comma — Mahrkeenerh
– Mahrkeenerh, Commented Nov 2, 2021 at 8:28
Python already has a module for CSV parsing, don't try reinventing it with regex. — jonrsharpe
– jonrsharpe, Commented Nov 2, 2021 at 8:31
It's not that I'm trying to reinvent it, I'm currently trying to define MapReduce jobs and for that I must use REGEX in order to keep the operations as simple as possible. I don't read the file from the main but instead I feed it to STDIN on the terminal while running the python script. — Xhuliano Tatazi
– Xhuliano Tatazi, Commented Nov 2, 2021 at 8:36
Why do you redirect the output? Sounds like loading it from file with extra steps — Mahrkeenerh
– Mahrkeenerh, Commented Nov 2, 2021 at 8:37

Mahrkeenerh · Accepted Answer · 2021-11-02 08:32:20Z

2

If you're trying to parse a .csv file, don't do it by hand, Python already has loads of libraries that will do it for you.

Otherwise if your string has quotation marks when there is a comma inside the title, and doesn't have them when there is not, you can do it like this:

>>> x = '315,"Misérables, Les (1995)",Drama|War'
>>> y = '315,Big Bully (1996),Comedy|Drama'
>>> x
'315,"Misérables, Les (1995)",Drama|War'
>>> y
'315,Big Bully (1996),Comedy|Drama'

>>> x.split('"') if len(x.split('"')) == 3 else x.split(',')
['315,', 'Misérables, Les (1995)', ',Drama|War']
>>> y.split('"') if len(y.split('"')) == 3 else y.split(',')
['315', 'Big Bully (1996)', 'Comedy|Drama']

This leaves the comma inside the first and last part (if it's split by a quotation mark), so you will have to remove them afterwards manually.

answered Nov 2, 2021 at 8:32

Mahrkeenerh

1,1201 gold badge11 silver badges27 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Xhuliano Tatazi Over a year ago

Still on the first line the commas stay inside the split elements, I have to avoid it so the output is the same as the second line.

Mahrkeenerh Over a year ago

that's why I said that it keeps them there, so you need to clean it up afterwards. If you only have 3 columns, it's easy , just remove commas from first and last part. I just wasn't sure if you could have more columns, with more titles, so I didn't want to make unnecessary generic solution for a problem that doesn't require it

Xhuliano Tatazi Over a year ago

Yes actually, I managed to make it work without feeding it to the STDIN but from code. Thank you it was really helpful!

Batuhan Atalay · Accepted Answer · 2021-11-02 08:33:14Z

0

Actually, you do not need to use REGEX for this problem. QUOTING will solve this.

For example:

filereader = csv.reader(csv_input_file, delimiter=',', quotechar='"')

give it a try to solve your problem

answered Nov 2, 2021 at 8:33

Batuhan Atalay

1481 silver badge7 bronze badges

Collectives™ on Stack Overflow

REGEX for complex strings

2 Answers 2

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related