I am fairly new to Python.
I have a .txt file with almost ~500k lines of text. The general structure is like this:
WARC-TREC-ID:
hello
my
name
is
WARC-TREC-ID:
example
text
WARC-TREC-ID:
I would like to extract all contents in between the "WARC-TREC-ID:" keywords.
This is what I have already tried:
content_list = []
with open('C://Users//HOME//Desktop//Document_S//corpus_test//00.txt', errors = 'ignore') as openfile2:
for line in openfile2:
for item in line.split("WARC-TREC-ID:"):
if "WARC-TREC-ID:" in item:
content = (item [ item.find("WARC-TREC-ID:")+len("WARC-TREC-ID:") : ])
content_list.append(content)
this returns an empty list.
I have also tried:
import re
with open('C://Users//HOME//Desktop//Document_S//corpus_test//00.txt', 'r') as openfile3:
m = re.search('WARC-TREC-ID:(.+?)WARC-TREC-ID:', openfile3)
if m:
found = m.group(1)
and this causes a TypeError: expected string or bytes-like object