-1

I'm currently attempting to scrape a wiki for some image files. I have determined that every image I want is hosted at a URL with the following structure:

https://static.wikia.nocookie.net/<game name>/images/X/XY/<file name.png>

In all cases I know the exact file name corresponding to the image I'm searching for. However, here's my issue: X is always a one-digit hexadecimal number e.g. 3, and XY is always a two-digit hexadecimal number whose first digit is the same as X, e.g. 3c. But as far as I can tell these numbers are completely arbitrary and there is no way to reliably predict them in advance for a specific image I want to retrieve.

My plan moving forward is to search through the entire web directory until I find the files I want, check the exact URL they are stored at, and write them to a local file for instantaneous subsequent lookup. To accomplish this, I see two options:

  1. For a given X and XY, I could somehow retrieve the entire directory at .../images/X/XY/, check what files are stored there, and write all of the URLs to a local file.
  2. For a given file name, I iterate through all possible combinations of X and XY until I find where the file is stored, and write its URL to a local file.

In total I have several thousand images I want to find the URLs for. Given that, option 1 would appear to be an astronomically faster approach, but I'm not sure if retrieving an entire directory of files from the web at once is possible. Can it be done with HTTPS requests (I'm using Node.js for reference)? If not, are there any other tools I could potentially use, or will I have to resort to option 2?

4
  • 2
    Option 1 is normally impossible unless the website provides a directory listing or some other sort of index/API to get the full URL or even the Y you need. Are you sure that Y is completely random? It could be derived from the file name for example. If it is random, then most likely option 2 is the only option. Commented May 9 at 21:15
  • 3
    "Can I do this on the web" does not really make sense. Every server you can connect to is free to make its own rules about what you can and cannot access. Commented May 9 at 21:17
  • 1
    Have you checked the terms of use of the site? Are you aware of copyright laws? Commented May 9 at 21:22
  • @MarcoBonelli I've been looking but I haven't been able to find any patterns. I was really hoping to avoid option 2 if possible, since it takes ~200-250 ms per request and therefore finding a file could take up to a minute to iterate through all possibilities. That's gonna add up over several thousand files. Commented May 9 at 21:39

1 Answer 1

2

It is in general not possible to predict the URLs unless the website provides directory listing or other APIs to retrieve them. However, if the URLs are predictable or derived from other known information like file name, it may be doable.

Before continuing, I should remind you that different websites have different terms of services which may or may not allow scraping of content (i.e. what you are describing). Furthermore, the media you want to download may be subject to copyright or usage licenses. Make sure to comply with ToS and copyright/licensing before downloading and using it.

Moving on: I have done a quick test on images from static.wikia.nocookie.net and it seems like the two directory names simply come from the first hexadecimal characters of the MD5 hash of the file name.

Example:

# Full URL: https://static.wikia.nocookie.net/callofduty/images/e/e0/MW3_UAV_Recon.png
echo -n 'MW3_UAV_Recon.png' | md5sum
e05f5e0241b572f06a5246b5f201140b  -

Thus if you know the game name and the file name you have all the info you need:

const crypto = require('crypto')

function calculateURL(gameName, fileName) {
    const hash = crypto.createHash('md5').update(fileName).digest('hex')
    return `https://static.wikia.nocookie.net/${gameName}/images/${hash[0]}/${hash[0]}${hash[1]}/${fileName}`
}
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.