Parsing in Python where delimiter also appears in the data

Question

Wow, I'm thankful for all of the responses on this! To clarify the data pattern does repeat. Here is a sample:

Item: some text Name: some other text Time recorded: hh:mm Time left: hh:mm 
 other unrelated text some other unrelated text lots more text that is unrelated Item: some text Name: some other text Time recorded: hh:mm Time left: hh:mm  other unrelated text some other unrelated text lots more text that is unrelated Item: some text Name: some other text Time recorded: hh:mm Time left: hh:mm 
 and so on and so on

I am using Python 3.7 to parse input from a text file that is formatted like this sample:

Item: some text Name: some other text Time recorded: hh:mm Time left: hh:mm

and the pattern repeats, with other similar fields, through a few hundred pages.

Because there is a ":" value in some of the values (i.e. hh:mm), I not sure how to use that as a delimiter between the key and the value. I need to obtain all of the values associated with "Item", "Name", and "Time left" and output all of the matching values to a CSV file (I have the output part working)

Any suggestions? Thank you!

(apologies, I asked this on Stack Exchange and it was deleted, I'm new at this)

it might be possible to use ': ' (with the space) as delimiter — Leafar
– Leafar, Commented Aug 20, 2019 at 17:48
Is that the correct format for the data, exactly? If so, I don't see actual uses of delimiters. This is more extracting text and a regex issue. Delimited data would have a colon after some text, too. — T.Woody
– T.Woody, Commented Aug 20, 2019 at 17:53
Also, if this code has any sort of value and is production based, you will want to speak with the creator of that file and have it properly formatted. That is the correct response here. — T.Woody
– T.Woody, Commented Aug 20, 2019 at 18:30
Can you provide a couple of other examples for such data? Also, do you convert each line to a CSV row? — zmbq
– zmbq, Commented Aug 20, 2019 at 18:55
Hi, thanks for everyone's reply! Here is some more sample data. The pattern does repeat, and I need to extract every pair in the file that matches the three that I'm looking for Item: some text Name: some other text Time recorded: hh:mm Time left: hh:mm other unrelated text some other unrelated text lots more text that is unrelated Item: some text Name: some other text Time recorded: hh:mm Time left: hh:mm other unrelated text some other unrelated text lots more text that is unrelated Item: some text Name: some other text Time recorded: hh:mm Time left: hh:mm and on and on — cb99
– cb99, Commented Aug 20, 2019 at 20:40

javidcf · Accepted Answer · 2019-08-20 17:55:01Z

2

You can use a regular expression.

import re

rgx = re.compile(r'^Item: (.*) Name: (.*) Time recorded: (.*) Time left: (.*)$')
data = 'Item: some text Name: some other text Time recorded: hh:mm Time left: hh:mm'
item, name, time_recorded, time_left = rgx.match(data).groups()
print(item, name, time_recorded, time_left, sep='\n')
# some text
# some other text
# hh:mm
# hh:mm

answered Aug 20, 2019 at 17:55

javidcf

59.9k7 gold badges87 silver badges134 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

dhanlin Over a year ago

this will fail if the pattern repeats. and the question clearly says the pattern repeats

javidcf Over a year ago

@dhanlin Yes, I'm not sure if OP means there is one per line, in which case this could work, or one after another.

dhanlin · Accepted Answer · 2019-08-20 18:47:37Z

1

This should help solve your problem. even if pattern repeats any number of times.

import re
str1 = "Item: some text Name: some other text Name:Time recorded: hh:mm Time left: hh1:mm1"

# this regex will capture all data occurring repeatedly over any number of times. Only the last pattern will not be captured.
# sidenote: ignore the 1st element in output list.
print (re.findall('(.*?)(?:Item:|Name:|Time left:)', str1))

# below given regex captures only the last pattern.
print (re.findall('.*(?:Item:|Name:|Time left:)(.*)$', str1))

OutPut : 
['', ' some text ', ' some other text ', 'Time recorded: hh:mm ']
[' hh1:mm1']

edited Aug 20, 2019 at 18:47

answered Aug 20, 2019 at 18:39

dhanlin

1457 bronze badges

Comments

Lipis · Accepted Answer · 2019-08-20 17:49:28Z

1

Use the ': ' (with a space) as a delimiter.

answered Aug 20, 2019 at 17:49

Lipis

21.9k21 gold badges97 silver badges121 bronze badges

2 Comments

T.Woody Over a year ago

Please read OP's question again, as I firmly believe OP does not understand what delimiter means here.

zmbq Over a year ago

That won't do, you will not be able to tell when the value stops and the new field starts.

pjmv · Accepted Answer · 2019-08-20 18:24:33Z

If your data is simple enough and you don't want to use regexes, you can sequentially split your input string on each label eg:

def split_annoying_string(input, labels):
    data = []

    temp_string = input.split(labels[0] + ": ")[1]

    for label in labels[1:]:
        print(temp_string)
        temp_data, temp_string = temp_string.split(" " + label + ": ")
        data.append(temp_data)
    data.append(temp_string)
    return data


input_string = "Item: some text Name: some other text Time recorded: hh:mm Time left: hh:mm"
labels = ["Item", "Name", "Time recorded", "Time left"]

data = split_annoying_string(input_string, labels)
print(data)
#['some text', 'some other text', 'hh:mm', 'hh:mm']

You really should consider getting familiar with regexes though, as ad-hoc hacks such as the one above typically don't adapt very well to changing input formats.

Collectives™ on Stack Overflow

Parsing in Python where delimiter also appears in the data

4 Answers 4

2 Comments

Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related