4

Possible Duplicate:
Python: Split string with multiple delimiters

Can I do something similar in Python?

Split method in VB.net:

Dim line As String = "Tech ID: xxxxxxxxxx Name: DOE, JOHN Account #: xxxxxxxx"
Dim separators() As String = {"Tech ID:", "Name:", "Account #:"}
Dim result() As String
result = line.Split(separators, StringSplitOptions.RemoveEmptyEntries)
0

3 Answers 3

2

Given a bad data format like this, you could try re.split():

>>> import re
>>> mystring = "Field 1: Data 1 Field 2: Data 2 Field 3: Data 3"
>>> a = re.split(r"(Field 1:|Field 2:|Field 3:)",mystring)
['', 'Field 1:', ' Data 1 ', 'Field 2:', ' Data 2 ', 'Field 3:', ' Data 3']

Your job would be much easier if the data was sanely formatted, with quoted strings and comma-separated records. This would admit the use of the csv module for parsing of comma-separated value files.

Edit:

You can filter out the blank entries with a list comprehension.

>>> a_non_empty = [s for s in a if s]
>>> a_non_empty
['Field 1:', ' Data 1 ', 'Field 2:', ' Data 2 ', 'Field 3:', ' Data 3']
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for that! I know about the data format. Unfortunately it's a PITA pdf to csv conversion I'm trying to make.
Could you elaborate a bit? I'm very new and don't understand your code.
1
>>> import re
>>> str = "Tech ID: xxxxxxxxxx Name: DOE, JOHN Account #: xxxxxxxx"
>>> re.split("Tech ID:|Name:|Account #:",str)
['', ' xxxxxxxxxx ', ' DOE, JOHN ', ' xxxxxxxx']

7 Comments

Why do the split tokens themselves not appear in your output? Python 2 vs. Python 3 difference?
That's a good question. I didn't catch that.
@Li-aungYip: :) Not really, nothing to do with Python version. Just that I did not enclose the pattern in (...) as a result they did not get captured.
Just one small thing, you may not want to call your variable str since it is the name of a builtin
Ah, I'm silly. I didn't realise that the split pattern would actually allow capturing.
|
0

I would suggest a different approach:

>>> import re
>>> subject = "Tech ID: xxxxxxxxxx Name: DOE, JOHN Account #: xxxxxxxx"
>>> regex = re.compile(r"(Tech ID|Name|Account #):\s*(.*?)\s*(?=Tech ID:|Name:|Account #:|$)")
>>> dict(regex.findall(subject))
{'Tech ID': 'xxxxxxxxxx', 'Name': 'DOE, JOHN', 'Account #': 'xxxxxxxx'}

That way you get a useful data structure for this kind of data: a dictionary.

As a commented regex:

regex = re.compile(
    r"""(?x)                         # Verbose regex:
    (Tech\ ID|Name|Account\ \#)      # Match identifier
    :                                # Match a colon
    \s*                              # Match optional whitespace
    (.*?)                            # Match any number of characters, as few as possible
    \s*                              # Match optional whitespace
    (?=                              # Assert that the following can be matched:
     Tech\ ID:|Name:|Account\ \#:    # The next identifier
     |$                              # or the end of the string
    )                                # End of lookahead assertion""")

2 Comments

This doesn't seem like a good approach to me since you are repeating the identifiers.
@jamylak: I know but how else would you be able to tell when the value has ended? It would be much better of course if you could preserve the delimiters but that doesn't seem to be an option.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.