0

I would like to scrape the problems from these Go (board game) books, and convert them into SGFs, if they aren't in that format already. For now, I would be satisfied with only taking the problems themselves, no need for the answer variations, only the initial setup.

The link above might not work because some of the pages need you to be logged in. A standalone question link like this one seems to work without being logged in for now though, but that depends on that website's dev's whims.

That website is using a <canvas> component to draw the problems but I can't seem to find where the data is. I think they are not using SGF — SGF is a text format for encoding trees, it's the standard file type for Go — but their own coordinate system in a JSON. There's a var qqdata in one of the <script> tags at the end of the HTML file, but I'm not sure how to translate that into SGF coordinates.

This other project already extracts the data from these webpages (although I haven't yet been able to reproduce it), but I think it does things visually from the <canvas>?

I would prefer an answer in TypeScript, if possible, but I Python would also be ok.

What would be the best way of scraping the data from that website?

11
  • The site I'm seeing is just PNG images, like view-source:static4.101weiqi.com/file/qimg1/18020.png. There are no canvases on the page or a qqdata variable. Can you clarify what you're looking for please? SGF isn't a common-knowledge format so maybe an example of what you're hoping to get would be helpful. Thanks! Commented Jul 26 at 3:11
  • 1
    Try this problem link. I think this one works without being logged in. Commented Jul 26 at 3:16
  • 1
    Ok, I've just added a not about it. Commented Jul 26 at 3:22
  • 1
    I don't have an answer, but if you can maybe reverse engineer parseSgfPtLst from the main .js file, you can potentially plug in whatever data is the input into it. Scraping canvas seems tough because you'd probably need to play the move to move onto the next board, repeat until solved. Commented Jul 26 at 4:36
  • 1
    qqdata.c / qqdata.content have base64 that decodes to a binary format that seem to include point data Commented Jul 26 at 5:01

2 Answers 2

2
+25

The "best" as in, it's probably work better than any other method, is to try to reverse engineer the page.

One of the first things we can notice on this page is that it contains an inline <script> tag that does define a few global variables, with one that looks a lot like a config file.

In there we can see that the content property, which is encoded in base64 does hold what seems to be binary data. This is probably something we're interested in. Now we can try to check where this property is being read and try to decode it. One technique to do so is to save a copy of the page locally, and then modify that object so that we convert the property to a getter function which will hold a debugger; statement.

Doing so we can see that it's been read inside some test123 method. After deobfuscation, that method and its dependencies can be rewritten as

function parseGamePos(config) {
  if (Array.isArray(config.content)) return config.content;
  const i = config.ru + 1;
  const key = `101${i}${i}${i}`;
  const bytes = atob(String(config.content).replace(/[\t\n\f\r ]+/g, ""));
  const output = [];
  let keyIndex = 0;
  for (let n = 0; n < bytes.length; ++n) {
    output.push(String.fromCharCode(bytes.charCodeAt(n) ^ key.charCodeAt(keyIndex)));
    keyIndex = (keyIndex + 1) % key.length;
  }
  return JSON.parse(output.join(""));
}

Now you can paste that function into your console, and call it as parseGamePos(qqdata). This will output an array of 2 arrays, containing the player's positions as a string of 2 letters, corresponding to the xy position on the grid.

You can determine the which player is to play by checking the boolean qqdata.blackfirst.

From there, I suppose you can rebuild the game.


Ps: note that the answers are sent in "clear" inside qqdata.answers[n].pts.

Sign up to request clarification or add additional context in comments.

3 Comments

Thank you so much! I'll test it out soon. As an aside, which JS package do you recommend for scraping multiple problems contained in books on that website?
In your deobfuscated example, did you mean bytes.length instead of i.length? I saw in the obfuscated source i held its own atob-style decoded content, while in your code i is a number (i = config.ru + 1). I changed it to bytes.length and it ran fine for me using the qqdata config from 101weiqi.com/qday/2025/7/30/1
suggested an edit
0

I don't think you need to deobfuscate anything actually, I was able to extract the data later on through qqdata.content like this:

function toSGFCoords(data) {
  const blackCoords = data.content[0]
  const whiteCoords = data.content[1]
  const sgfPrefix = ";GM[1]FF[4]CA[UTF-8]SZ[19]"

  const blackStones = blackCoords.map((coord) => `[${coord}]`).join("")
  const whiteStones = whiteCoords.map((coord) => `[${coord}]`).join("")

  const sgf = `(${sgfPrefix}C[${data.id} | ${data.levelname}]AB${blackStones}AW${whiteStones})`

  return sgf
}

async function fetchProblem(problemId, path = "./problems/", filename = "") {
  // Go to the Problem Page
  await page.goto(`https://www.101weiqi.com/q/${problemId}/`)

  // Getting the Data into an SGF
  const qqdata = await page.evaluate(() => window.qqdata)
  // console.log(qqdata)
  console.log(qqdata.id)
  const sgfString = toSGFCoords(qqdata)

  // Save the SGF to a File
  const name = filename === "" ? qqdata.id : filename
  const folder = qqdata.levelname.substring(0, 2)
  const fullPath = `${path}/${folder}`
  if (!fs.existsSync(fullPath)) fs.mkdirSync(fullPath, { recursive: true })
  const filePath = `${fullPath}/${name}.sgf`
  fs.writeFileSync(filePath, sgfString)
}

The real issue is to not get blocked by 101weiqi in the end. I think they've faced so many web scrapers at this point that they are getting more and more aggressive. And some problems I was trying to scrape from book pages seem to have weird data, such as having URLs and metadata differing.

I've been working on a scraping script in this project.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.