How to extract contents between two strings in Python?

Question

I am fairly new to Python.

I have a .txt file with almost ~500k lines of text. The general structure is like this:

WARC-TREC-ID:

hello

my

name

is

WARC-TREC-ID:

example

text

WARC-TREC-ID:

I would like to extract all contents in between the "WARC-TREC-ID:" keywords.

This is what I have already tried:

    content_list = []

with open('C://Users//HOME//Desktop//Document_S//corpus_test//00.txt', errors = 'ignore') as openfile2:
    for line in openfile2:
        for item in line.split("WARC-TREC-ID:"):
            if "WARC-TREC-ID:" in item:
                content = (item [ item.find("WARC-TREC-ID:")+len("WARC-TREC-ID:") : ])
                content_list.append(content)

this returns an empty list.

I have also tried:

    import re

with open('C://Users//HOME//Desktop//Document_S//corpus_test//00.txt', 'r') as openfile3:
    
    m = re.search('WARC-TREC-ID:(.+?)WARC-TREC-ID:', openfile3)
    if m: 
        found = m.group(1)

and this causes a TypeError: expected string or bytes-like object

Rakesh · Accepted Answer · 2020-02-06 07:53:53Z

2

Try:

content_list = []
with open(filename) as infile:
    for line in infile:               #Iterate each line
        if 'WARC-TREC-ID:' in line:   #check if line contains 'WARC-TREC-ID:'
            content_list.append([])   #Append empty list
        else:
            content_list[-1].append(line)   #Append content

print(content_list)

answered Feb 6, 2020 at 7:53

Rakesh

82.9k17 gold badges85 silver badges122 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Zeshan Fayyaz Over a year ago

I am getting a list index-out-of-range error on the following line: "content_list[-1].append(line) "

Rakesh Over a year ago

Try declaring content_list = [[]]

Zeshan Fayyaz Over a year ago

I changed the declaration of content_list as well as removed any text before the first 'WARC-TREC-ID:' Now, I am getting the following error "UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 5720: character maps to <undefined>"

Zeshan Fayyaz Over a year ago

I added an errors = 'ignore' parameter in my open file line and it worked. Thank you

Abdul Mateen · Accepted Answer · 2020-02-06 07:53:42Z

0

In your second approach, you should pass your file content as string as it expects a string argument, not file. And this too, will only return the first occurrence of that string. You might want to use findall.

answered Feb 6, 2020 at 7:53

Abdul Mateen

1,7341 gold badge16 silver badges35 bronze badges

Comments

Zaraki Kenpachi · Accepted Answer · 2020-02-06 08:23:21Z

-1

For file that contains you data:

raw_data = open('data.txt', 'r').read()
result = [x for x in raw_data.split() if x != 'WARC-TREC-ID:']

Output:

['hello', 'my', 'name', 'is', 'example', 'text']

answered Feb 6, 2020 at 8:23

Zaraki Kenpachi

5,7702 gold badges17 silver badges40 bronze badges

Collectives™ on Stack Overflow

How to extract contents between two strings in Python?

3 Answers 3

4 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related