0

I'm trying to parse the text in the ebooks at gutenberg.org to extract info about the books, for example, the title.

Every book on there has a line like this:

*** START OF THIS PROJECT GUTENBERG EBOOK THE ADVENTURES OF SHERLOCK HOLMES *** 

I'd like to use some thing like this:

book_name=()
index = 0
for line in finalLines:
    index+=1
    if  "*** START OF THIS PROJECT GUTENBERG EBOOK "%%%"***" in line:
        print(index, line)
        book_name=%%%

but I'm obviously not doing it right. Can someone show me how it's done??

1

3 Answers 3

3

Regex is the way to go:

import re

title_regex = re.compile(r'\*{3} START OF THIS PROJECT GUTENBERG EBOOK (.*?) \*{3}')

for index, line in enumerate(finalLines):
    match = title_regex.match(line)

    if match:
        book_name = match.group(1)
        print(index, book_name)

You can also parse it line-by-line:

import urllib.request

url = 'http://www.gutenberg.org/cache/epub/1342/pg1342.txt'
book = urllib.request.urlopen(url)
lines = book.readlines()
book.close()

reached_start = False
metadata = {}

for index, line in enumerate(lines):
    if line.startswith('***'):
        if not reached_start:
            reached_start = True
        else:
            break

    if not reached_start and ':' in line:
        key, _, value = line.partition(':')
        metadata[key.lower()] = value
Sign up to request clarification or add additional context in comments.

12 Comments

:( TypeError: can't use a string pattern on a bytes-like object
@user2344772: Change r'\*{3} to b'\*{3} and give it a go.
@user2344772 where do the loines come from? if you have bytes there, it is a different issue.
and the lines come from this:
the b is just a representation of the fact that it is a byte string and not a simple string.
|
2

The simplest solution:

sp = line.split()
if sp[:7]+sp[-1:] == '*** START OF THIS PROJECT GUTENBERG EBOOK ***'.split():
    bookname = ' '.join(sp[7:-1])

A better solustion will use regular expression, as suggested.

If you are working with bytes, you should use b'*** START OF THIS PROJECT GUTENBERG EBOOK ***', or use bytes.decode(s) for any byte string.

Your snippet (with the urlopen() part) might look like this:

import urllib.request
url = 'http://gutenberg.org/cache/epub/1342/pg1342.txt'
with urllib.request.urlopen(url) as book:
    finalLines = book.readlines()

booktitle_pattern = '*** START OF THIS PROJECT GUTENBERG EBOOK ***'.split()
bookname = None
for index, line in enumerate(finalLines):
    sp = [bytes.decode(word) for word in line.split()]
    if sp[:7]+sp[-1:] == booktitle_pattern :
        bookname = ' '.join(sp[7:-1])

4 Comments

that code didn't really do it, i'm afraid.What do the numbers 7 and -1 represent?
Oh nvm i get it, it's split into words. Still can't make it work, though
i'm confused, are you using booktitle and bookname for different things? Sorry, I'm very new and this all quite confusing to me
@Deivore two things: (a) I tried to give it a better name. it's irrelevant that you are new; names should be clear. (b) I added a bit about using bytes instead of simple strings, since it seems like what happens in your case.
0
import urllib.request

url = 'http://www.gutenberg.org/cache/epub/1342/pg1342.txt'
book = urllib.request.urlopen(url)
lines = book.readlines()
book.close()



import re

title_regex = re.compile(b'\*{3} START OF THIS PROJECT GUTENBERG EBOOK (.*?) \*{3}')

for index, line in enumerate(lines):
    match = title_regex.match(line)

    if match:
        book_name = match.group(1)
        print(book_name)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.