Is there a way to get a string out of HTML?

Question

I'd like a way to generate a filename for osu! beatmaps which I'm downloading. Ideally I would go through the HTML looking for a certain phrase, beatmapsets?q=, and get the word(s) that come after the q=.

I've tried using lxml.html, however I have little experience in it, and in the code below, it returns an empty list.

class OsuMaps:
    def generateFileName(self, num1=None):
        if not num1:
            print("Missing required argument: 'num1'")
            return
        dl = requests.get(f"https://bloodcat.com/osu/s/{num1.rstrip()}")

        # ..generate FinalName

        tree = fromstring(dl.content)
        FinalName = tree.xpath(
            "//a[contains(@href='beatmapsets?q=')]"
        )

        return FinalName
osu - OsuMaps()
osu.generateFileName("653534") # ideal outcome - "653534 Panda Eyes - ILY"

The ideal result is commented in, however I don't know where to start. All I know is the two keywords [that being the songname, ILY, and artist, Panda Eyes] I need are in the HTML as:

<a class="beatmapset-header__details-text beatmapset-header__details-text--title u-ellipsis-overflow" href="/beatmapsets?q=ILY">ILY</a>

and

<a class="beatmapset-header__details-text beatmapset-header__details-text--artist" href="/beatmapsets?q=Panda%20Eyes">Panda Eyes</a>

I would also need to be able to re-use this code so that it gets q=<text> text every time.

The url https://bloodcat.com/osu/s/653534 will prompt to download the file 653534 Panda Eyes - ILY.osz - it's not html content. len(dl.text) --> 9123471 The question is not relevant — RomanPerekhrest
– RomanPerekhrest, Commented Jul 20, 2019 at 19:48
@RomanPerekhrest if it automatically downloads the file with the desired name, is there a way for me to maintain the original filename and avoid the whole generateFilename() function? — xupaii
– xupaii, Commented Jul 20, 2019 at 21:57

May.D · Accepted Answer · 2019-07-20 23:04:38Z

1

According to requests documentation, requests.get.content returns the raw bytes response. What you need to parse is dl.text.

Also has @RomanPerekhrest points out, the given link refers to a binary file so parsing it with lxml wont make sense. However you can use requests.head() method to get the file name and extract the data you need.

Try something like below:

dl = requests.head(f"https://bloodcat.com/osu/s/{num1.rstrip()}") 
fname = dl.headers["Content-Disposition"].split('filename="')[-1].split('";')[0].replace("%20", " ")

# fname == '653534 Panda Eyes - ILY.osz'

edited Jul 20, 2019 at 23:04

answered Jul 20, 2019 at 19:48

May.D

1,9201 gold badge21 silver badges35 bronze badges

Sign up to request clarification or add additional context in comments.

17 Comments

xupaii Over a year ago

How could I use requests.head()? I don't quite understand.

May.D Over a year ago

See edited answer. You just need to improve parsing, mine is a bit dirty.

xupaii Over a year ago

I've done: dl = requests.head(f"https://bloodcat.com/osu/s/{num1.rstrip()}") , then x = dl.headers.get("Content-Disposition") and then y = x.split("filename=\"")[1].split(".osz\";")[0].replace("%20", " ") return y

May.D Over a year ago

I would just add a check on the response code/content or on x to be sure Content-Disposition header is present, and you should be good to go.

xupaii Over a year ago

i would also make sure all other percent-encoding is gone too, haha!

|

Collectives™ on Stack Overflow

Is there a way to get a string out of HTML?

1 Answer 1

17 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

17 Comments

Your Answer

Sign up or log in

Post as a guest

Related