Python: How to use %%% when parsing text

Question

I'm trying to parse the text in the ebooks at gutenberg.org to extract info about the books, for example, the title.

Every book on there has a line like this:

*** START OF THIS PROJECT GUTENBERG EBOOK THE ADVENTURES OF SHERLOCK HOLMES ***

I'd like to use some thing like this:

book_name=()
index = 0
for line in finalLines:
    index+=1
    if  "*** START OF THIS PROJECT GUTENBERG EBOOK "%%%"***" in line:
        print(index, line)
        book_name=%%%

but I'm obviously not doing it right. Can someone show me how it's done??

It sounds like you want to use a regex, something like \*\*\* START OF THIS PROJECT GUTENBERG EBOOK (.*) \*\*\*. Learn more: docs.python.org/library/re.html regular-expressions.info/reference.html regexpal.com — Patashu
– Patashu, Commented May 12, 2013 at 2:02

Blender · Accepted Answer · 2013-05-12 03:04:08Z

3

Regex is the way to go:

import re

title_regex = re.compile(r'\*{3} START OF THIS PROJECT GUTENBERG EBOOK (.*?) \*{3}')

for index, line in enumerate(finalLines):
    match = title_regex.match(line)

    if match:
        book_name = match.group(1)
        print(index, book_name)

You can also parse it line-by-line:

import urllib.request

url = 'http://www.gutenberg.org/cache/epub/1342/pg1342.txt'
book = urllib.request.urlopen(url)
lines = book.readlines()
book.close()

reached_start = False
metadata = {}

for index, line in enumerate(lines):
    if line.startswith('***'):
        if not reached_start:
            reached_start = True
        else:
            break

    if not reached_start and ':' in line:
        key, _, value = line.partition(':')
        metadata[key.lower()] = value

edited May 12, 2013 at 3:04

answered May 12, 2013 at 2:22

Blender

300k55 gold badges463 silver badges512 bronze badges

Sign up to request clarification or add additional context in comments.

12 Comments

Deivore Over a year ago

:( TypeError: can't use a string pattern on a bytes-like object

Blender Over a year ago

@user2344772: Change r'\*{3} to b'\*{3} and give it a go.

Elazar Over a year ago

@user2344772 where do the loines come from? if you have bytes there, it is a different issue.

Deivore Over a year ago

and the lines come from this:

Elazar Over a year ago

the b is just a representation of the fact that it is a byte string and not a simple string.

|

Elazar · Accepted Answer · 2013-05-12 03:10:22Z

2

The simplest solution:

sp = line.split()
if sp[:7]+sp[-1:] == '*** START OF THIS PROJECT GUTENBERG EBOOK ***'.split():
    bookname = ' '.join(sp[7:-1])

A better solustion will use regular expression, as suggested.

If you are working with bytes, you should use b'*** START OF THIS PROJECT GUTENBERG EBOOK ***', or use bytes.decode(s) for any byte string.

Your snippet (with the urlopen() part) might look like this:

import urllib.request
url = 'http://gutenberg.org/cache/epub/1342/pg1342.txt'
with urllib.request.urlopen(url) as book:
    finalLines = book.readlines()

booktitle_pattern = '*** START OF THIS PROJECT GUTENBERG EBOOK ***'.split()
bookname = None
for index, line in enumerate(finalLines):
    sp = [bytes.decode(word) for word in line.split()]
    if sp[:7]+sp[-1:] == booktitle_pattern :
        bookname = ' '.join(sp[7:-1])

edited May 12, 2013 at 3:10

answered May 12, 2013 at 2:03

Elazar

22k4 gold badges51 silver badges68 bronze badges

4 Comments

Deivore Over a year ago

that code didn't really do it, i'm afraid.What do the numbers 7 and -1 represent?

Deivore Over a year ago

Oh nvm i get it, it's split into words. Still can't make it work, though

Deivore Over a year ago

i'm confused, are you using booktitle and bookname for different things? Sorry, I'm very new and this all quite confusing to me

Elazar Over a year ago

@Deivore two things: (a) I tried to give it a better name. it's irrelevant that you are new; names should be clear. (b) I added a bit about using bytes instead of simple strings, since it seems like what happens in your case.

Deivore · Accepted Answer · 2013-05-12 02:47:55Z

0

import urllib.request

url = 'http://www.gutenberg.org/cache/epub/1342/pg1342.txt'
book = urllib.request.urlopen(url)
lines = book.readlines()
book.close()



import re

title_regex = re.compile(b'\*{3} START OF THIS PROJECT GUTENBERG EBOOK (.*?) \*{3}')

for index, line in enumerate(lines):
    match = title_regex.match(line)

    if match:
        book_name = match.group(1)
        print(book_name)

answered May 12, 2013 at 2:47

Deivore

294 bronze badges

Collectives™ on Stack Overflow

Python: How to use %%% when parsing text

3 Answers 3

12 Comments

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

12 Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related