Split string by multiple separators? [duplicate]

Question

Possible Duplicate:
Python: Split string with multiple delimiters

Can I do something similar in Python?

Split method in VB.net:

Dim line As String = "Tech ID: xxxxxxxxxx Name: DOE, JOHN Account #: xxxxxxxx"
Dim separators() As String = {"Tech ID:", "Name:", "Account #:"}
Dim result() As String
result = line.Split(separators, StringSplitOptions.RemoveEmptyEntries)

Li-aung Yip · Accepted Answer · 2012-05-03 06:03:50Z

2

Given a bad data format like this, you could try re.split():

>>> import re
>>> mystring = "Field 1: Data 1 Field 2: Data 2 Field 3: Data 3"
>>> a = re.split(r"(Field 1:|Field 2:|Field 3:)",mystring)
['', 'Field 1:', ' Data 1 ', 'Field 2:', ' Data 2 ', 'Field 3:', ' Data 3']

Your job would be much easier if the data was sanely formatted, with quoted strings and comma-separated records. This would admit the use of the csv module for parsing of comma-separated value files.

Edit:

You can filter out the blank entries with a list comprehension.

>>> a_non_empty = [s for s in a if s]
>>> a_non_empty
['Field 1:', ' Data 1 ', 'Field 2:', ' Data 2 ', 'Field 3:', ' Data 3']

answered May 3, 2012 at 6:03

Li-aung Yip

12.5k5 gold badges36 silver badges51 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

fpena06 Over a year ago

Thanks for that! I know about the data format. Unfortunately it's a PITA pdf to csv conversion I'm trying to make.

fpena06 Over a year ago

Could you elaborate a bit? I'm very new and don't understand your code.

codaddict · Accepted Answer · 2012-05-03 06:05:10Z

1

>>> import re
>>> str = "Tech ID: xxxxxxxxxx Name: DOE, JOHN Account #: xxxxxxxx"
>>> re.split("Tech ID:|Name:|Account #:",str)
['', ' xxxxxxxxxx ', ' DOE, JOHN ', ' xxxxxxxx']

answered May 3, 2012 at 6:05

codaddict

457k83 gold badges501 silver badges537 bronze badges

7 Comments

Li-aung Yip Over a year ago

Why do the split tokens themselves not appear in your output? Python 2 vs. Python 3 difference?

fpena06 Over a year ago

That's a good question. I didn't catch that.

codaddict Over a year ago

@Li-aungYip: :) Not really, nothing to do with Python version. Just that I did not enclose the pattern in (...) as a result they did not get captured.

jamylak Over a year ago

Just one small thing, you may not want to call your variable str since it is the name of a builtin

Li-aung Yip Over a year ago

Ah, I'm silly. I didn't realise that the split pattern would actually allow capturing.

|

Tim Pietzcker · Accepted Answer · 2012-05-03 06:38:04Z

0

I would suggest a different approach:

>>> import re
>>> subject = "Tech ID: xxxxxxxxxx Name: DOE, JOHN Account #: xxxxxxxx"
>>> regex = re.compile(r"(Tech ID|Name|Account #):\s*(.*?)\s*(?=Tech ID:|Name:|Account #:|$)")
>>> dict(regex.findall(subject))
{'Tech ID': 'xxxxxxxxxx', 'Name': 'DOE, JOHN', 'Account #': 'xxxxxxxx'}

That way you get a useful data structure for this kind of data: a dictionary.

As a commented regex:

regex = re.compile(
    r"""(?x)                         # Verbose regex:
    (Tech\ ID|Name|Account\ \#)      # Match identifier
    :                                # Match a colon
    \s*                              # Match optional whitespace
    (.*?)                            # Match any number of characters, as few as possible
    \s*                              # Match optional whitespace
    (?=                              # Assert that the following can be matched:
     Tech\ ID:|Name:|Account\ \#:    # The next identifier
     |$                              # or the end of the string
    )                                # End of lookahead assertion""")

edited May 3, 2012 at 6:38

answered May 3, 2012 at 6:32

Tim Pietzcker

337k59 gold badges520 silver badges572 bronze badges

2 Comments

jamylak Over a year ago

This doesn't seem like a good approach to me since you are repeating the identifiers.

Tim Pietzcker Over a year ago

@jamylak: I know but how else would you be able to tell when the value has ended? It would be much better of course if you could preserve the delimiters but that doesn't seem to be an option.

Collectives™ on Stack Overflow

Split string by multiple separators? [duplicate]

3 Answers 3

2 Comments

7 Comments

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

7 Comments

2 Comments

Linked

Related