2

Wow, I'm thankful for all of the responses on this! To clarify the data pattern does repeat. Here is a sample:

Item: some text Name: some other text Time recorded: hh:mm Time left: hh:mm 
 other unrelated text some other unrelated text lots more text that is unrelated Item: some text Name: some other text Time recorded: hh:mm Time left: hh:mm  other unrelated text some other unrelated text lots more text that is unrelated Item: some text Name: some other text Time recorded: hh:mm Time left: hh:mm 
 and so on and so on

I am using Python 3.7 to parse input from a text file that is formatted like this sample:

Item: some text Name: some other text Time recorded: hh:mm Time left: hh:mm
and the pattern repeats, with other similar fields, through a few hundred pages.

Because there is a ":" value in some of the values (i.e. hh:mm), I not sure how to use that as a delimiter between the key and the value. I need to obtain all of the values associated with "Item", "Name", and "Time left" and output all of the matching values to a CSV file (I have the output part working)

Any suggestions? Thank you!

(apologies, I asked this on Stack Exchange and it was deleted, I'm new at this)

5
  • 3
    it might be possible to use ': ' (with the space) as delimiter Commented Aug 20, 2019 at 17:48
  • 1
    Is that the correct format for the data, exactly? If so, I don't see actual uses of delimiters. This is more extracting text and a regex issue. Delimited data would have a colon after some text, too. Commented Aug 20, 2019 at 17:53
  • 2
    Also, if this code has any sort of value and is production based, you will want to speak with the creator of that file and have it properly formatted. That is the correct response here. Commented Aug 20, 2019 at 18:30
  • 1
    Can you provide a couple of other examples for such data? Also, do you convert each line to a CSV row? Commented Aug 20, 2019 at 18:55
  • Hi, thanks for everyone's reply! Here is some more sample data. The pattern does repeat, and I need to extract every pair in the file that matches the three that I'm looking for Item: some text Name: some other text Time recorded: hh:mm Time left: hh:mm other unrelated text some other unrelated text lots more text that is unrelated Item: some text Name: some other text Time recorded: hh:mm Time left: hh:mm other unrelated text some other unrelated text lots more text that is unrelated Item: some text Name: some other text Time recorded: hh:mm Time left: hh:mm and on and on Commented Aug 20, 2019 at 20:40

4 Answers 4

2

You can use a regular expression.

import re

rgx = re.compile(r'^Item: (.*) Name: (.*) Time recorded: (.*) Time left: (.*)$')
data = 'Item: some text Name: some other text Time recorded: hh:mm Time left: hh:mm'
item, name, time_recorded, time_left = rgx.match(data).groups()
print(item, name, time_recorded, time_left, sep='\n')
# some text
# some other text
# hh:mm
# hh:mm
Sign up to request clarification or add additional context in comments.

2 Comments

this will fail if the pattern repeats. and the question clearly says the pattern repeats
@dhanlin Yes, I'm not sure if OP means there is one per line, in which case this could work, or one after another.
1

This should help solve your problem. even if pattern repeats any number of times.

import re
str1 = "Item: some text Name: some other text Name:Time recorded: hh:mm Time left: hh1:mm1"

# this regex will capture all data occurring repeatedly over any number of times. Only the last pattern will not be captured.
# sidenote: ignore the 1st element in output list.
print (re.findall('(.*?)(?:Item:|Name:|Time left:)', str1))

# below given regex captures only the last pattern.
print (re.findall('.*(?:Item:|Name:|Time left:)(.*)$', str1))

OutPut : 
['', ' some text ', ' some other text ', 'Time recorded: hh:mm ']
[' hh1:mm1']

Comments

1

Use the ': ' (with a space) as a delimiter.

2 Comments

Please read OP's question again, as I firmly believe OP does not understand what delimiter means here.
That won't do, you will not be able to tell when the value stops and the new field starts.
1

If your data is simple enough and you don't want to use regexes, you can sequentially split your input string on each label eg:

def split_annoying_string(input, labels):
    data = []

    temp_string = input.split(labels[0] + ": ")[1]

    for label in labels[1:]:
        print(temp_string)
        temp_data, temp_string = temp_string.split(" " + label + ": ")
        data.append(temp_data)
    data.append(temp_string)
    return data


input_string = "Item: some text Name: some other text Time recorded: hh:mm Time left: hh:mm"
labels = ["Item", "Name", "Time recorded", "Time left"]

data = split_annoying_string(input_string, labels)
print(data)
#['some text', 'some other text', 'hh:mm', 'hh:mm']

You really should consider getting familiar with regexes though, as ad-hoc hacks such as the one above typically don't adapt very well to changing input formats.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.