-3

I have the lines from a CSV file:

  1. 315,"Misérables, Les (1995)",Drama|War

  2. 315,Big Bully (1996),Comedy|Drama

I want to split the line and make a list of 3 elements and I need a general REGEX expression that splits where it encounters ',' but since the title may have a comma (As shown in the first line), I need to skip the parsing of the title. A title that has commas has also quotation marks but I need the expression to work for both cases. Is it possible doing it with REGEX?

I'm trying to learn REGEX by myself and I'm having difficulties understanding some cases. I could really appreciate your help!

7
  • by parenthesis do you mean quotation marks? ""? If so, you can first try to split the string by them, if you get 3 items, you are done, if not, split by comma Commented Nov 2, 2021 at 8:28
  • 8
    Python already has a module for CSV parsing, don't try reinventing it with regex. Commented Nov 2, 2021 at 8:31
  • This may help: stackoverflow.com/questions/54546368/… Commented Nov 2, 2021 at 8:34
  • It's not that I'm trying to reinvent it, I'm currently trying to define MapReduce jobs and for that I must use REGEX in order to keep the operations as simple as possible. I don't read the file from the main but instead I feed it to STDIN on the terminal while running the python script. Commented Nov 2, 2021 at 8:36
  • Why do you redirect the output? Sounds like loading it from file with extra steps Commented Nov 2, 2021 at 8:37

2 Answers 2

2

If you're trying to parse a .csv file, don't do it by hand, Python already has loads of libraries that will do it for you.

Otherwise if your string has quotation marks when there is a comma inside the title, and doesn't have them when there is not, you can do it like this:

>>> x = '315,"Misérables, Les (1995)",Drama|War'
>>> y = '315,Big Bully (1996),Comedy|Drama'
>>> x
'315,"Misérables, Les (1995)",Drama|War'
>>> y
'315,Big Bully (1996),Comedy|Drama'

>>> x.split('"') if len(x.split('"')) == 3 else x.split(',')
['315,', 'Misérables, Les (1995)', ',Drama|War']
>>> y.split('"') if len(y.split('"')) == 3 else y.split(',')
['315', 'Big Bully (1996)', 'Comedy|Drama']

This leaves the comma inside the first and last part (if it's split by a quotation mark), so you will have to remove them afterwards manually.

Sign up to request clarification or add additional context in comments.

3 Comments

Still on the first line the commas stay inside the split elements, I have to avoid it so the output is the same as the second line.
that's why I said that it keeps them there, so you need to clean it up afterwards. If you only have 3 columns, it's easy , just remove commas from first and last part. I just wasn't sure if you could have more columns, with more titles, so I didn't want to make unnecessary generic solution for a problem that doesn't require it
Yes actually, I managed to make it work without feeding it to the STDIN but from code. Thank you it was really helpful!
0

Actually, you do not need to use REGEX for this problem. QUOTING will solve this.

For example:

filereader = csv.reader(csv_input_file, delimiter=',', quotechar='"')

give it a try to solve your problem

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.