0

I am fairly new to Python.

I have a .txt file with almost ~500k lines of text. The general structure is like this:

WARC-TREC-ID:

hello

my

name

is

WARC-TREC-ID:

example

text

WARC-TREC-ID:

I would like to extract all contents in between the "WARC-TREC-ID:" keywords.

This is what I have already tried:

    content_list = []

with open('C://Users//HOME//Desktop//Document_S//corpus_test//00.txt', errors = 'ignore') as openfile2:
    for line in openfile2:
        for item in line.split("WARC-TREC-ID:"):
            if "WARC-TREC-ID:" in item:
                content = (item [ item.find("WARC-TREC-ID:")+len("WARC-TREC-ID:") : ])
                content_list.append(content)

this returns an empty list.

I have also tried:

    import re

with open('C://Users//HOME//Desktop//Document_S//corpus_test//00.txt', 'r') as openfile3:
    
    m = re.search('WARC-TREC-ID:(.+?)WARC-TREC-ID:', openfile3)
    if m: 
        found = m.group(1)

and this causes a TypeError: expected string or bytes-like object

3 Answers 3

2

Try:

content_list = []
with open(filename) as infile:
    for line in infile:               #Iterate each line
        if 'WARC-TREC-ID:' in line:   #check if line contains 'WARC-TREC-ID:'
            content_list.append([])   #Append empty list
        else:
            content_list[-1].append(line)   #Append content

print(content_list)
Sign up to request clarification or add additional context in comments.

4 Comments

I am getting a list index-out-of-range error on the following line: "content_list[-1].append(line) "
Try declaring content_list = [[]]
I changed the declaration of content_list as well as removed any text before the first 'WARC-TREC-ID:' Now, I am getting the following error "UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 5720: character maps to <undefined>"
I added an errors = 'ignore' parameter in my open file line and it worked. Thank you
0

In your second approach, you should pass your file content as string as it expects a string argument, not file. And this too, will only return the first occurrence of that string. You might want to use findall.

Comments

-1

For file that contains you data:

raw_data = open('data.txt', 'r').read()
result = [x for x in raw_data.split() if x != 'WARC-TREC-ID:']

Output:

['hello', 'my', 'name', 'is', 'example', 'text']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.