0

I'd like a way to generate a filename for osu! beatmaps which I'm downloading. Ideally I would go through the HTML looking for a certain phrase, beatmapsets?q=, and get the word(s) that come after the q=.

I've tried using lxml.html, however I have little experience in it, and in the code below, it returns an empty list.

class OsuMaps:
    def generateFileName(self, num1=None):
        if not num1:
            print("Missing required argument: 'num1'")
            return
        dl = requests.get(f"https://bloodcat.com/osu/s/{num1.rstrip()}")

        # ..generate FinalName

        tree = fromstring(dl.content)
        FinalName = tree.xpath(
            "//a[contains(@href='beatmapsets?q=')]"
        )

        return FinalName
osu - OsuMaps()
osu.generateFileName("653534") # ideal outcome - "653534 Panda Eyes - ILY"

The ideal result is commented in, however I don't know where to start. All I know is the two keywords [that being the songname, ILY, and artist, Panda Eyes] I need are in the HTML as:

<a class="beatmapset-header__details-text beatmapset-header__details-text--title u-ellipsis-overflow" href="/beatmapsets?q=ILY">ILY</a>

and

<a class="beatmapset-header__details-text beatmapset-header__details-text--artist" href="/beatmapsets?q=Panda%20Eyes">Panda Eyes</a>

I would also need to be able to re-use this code so that it gets q=<text> text every time.

2
  • The url https://bloodcat.com/osu/s/653534 will prompt to download the file 653534 Panda Eyes - ILY.osz - it's not html content. len(dl.text) --> 9123471 The question is not relevant Commented Jul 20, 2019 at 19:48
  • @RomanPerekhrest if it automatically downloads the file with the desired name, is there a way for me to maintain the original filename and avoid the whole generateFilename() function? Commented Jul 20, 2019 at 21:57

1 Answer 1

1

According to requests documentation, requests.get.content returns the raw bytes response. What you need to parse is dl.text.

Also has @RomanPerekhrest points out, the given link refers to a binary file so parsing it with lxml wont make sense. However you can use requests.head() method to get the file name and extract the data you need.

Try something like below:

dl = requests.head(f"https://bloodcat.com/osu/s/{num1.rstrip()}") 
fname = dl.headers["Content-Disposition"].split('filename="')[-1].split('";')[0].replace("%20", " ")

# fname == '653534 Panda Eyes - ILY.osz'
Sign up to request clarification or add additional context in comments.

17 Comments

How could I use requests.head()? I don't quite understand.
See edited answer. You just need to improve parsing, mine is a bit dirty.
I've done: dl = requests.head(f"https://bloodcat.com/osu/s/{num1.rstrip()}") , then x = dl.headers.get("Content-Disposition") and then y = x.split("filename=\"")[1].split(".osz\";")[0].replace("%20", " ") return y
I would just add a check on the response code/content or on x to be sure Content-Disposition header is present, and you should be good to go.
i would also make sure all other percent-encoding is gone too, haha!
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.