1

I am webscraping profile pages on khanacademy. I use their API (json file format) to do it.

Here is the profile link I would like to scrape: https://www.khanacademy.org/profile/Viruslala/

Here is its API link: https://www.khanacademy.org/api/internal/user/kaid_896965538702696832878421/profile/widgets?lang=en&_=190427-0731-8941ef3f07bd_1556382106890

My problem: most of the data is showing on the json file (API). But some specific data that I would like to scrape is not showing up.

I tried to search for a different API link but I didn't found the right one.

On the first image you have two kind of data I would like to scrape: Blue one and Yellow one.

enter image description here

On the json file : Blue is showing up. But not Yellow.

enter image description here

My questions are: Why yellow is not showing up? How can I get yellow with their API?

2 Answers 2

1

Yellow (profile info) can be regex'd out of original url response text.

Explore json here. The pattern r leads to extraction of string which can be loaded with json to produce dict containing all the info.

import requests
import json
import re

res = requests.get('https://www.khanacademy.org/profile/Viruslala/')
r = re.compile(r'profileInitOptions":(.*),"view"', re.DOTALL)
data = json.loads(r.findall(res.text)[0])
profile_data = data['profileData']
print(profile_data)

Notes:

The page loads the required content dynamically from a script tag when javascript runs on the page. As js does not run with requests you can still apply a regex pattern which grabs the javascript object housing the data of interest. You specify the pattern with:

r = re.compile(r'profileInitOptions":(.*),"view"', re.DOTALL)

then apply it to the response text, res.text, and extract the first returned match:

r.findall(res.text)[0]

In the case of this page, what is returned can be parsed with a json library:

json.loads(r.findall(res.text)[0])

The string is now in an dictionary object called data which you can access info from by key

data['profileData']

Regex:

enter image description here


re.DOTALL

Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline. Corresponds to the inline flag (?s).

Sign up to request clarification or add additional context in comments.

7 Comments

Here you are using https://www.khanacademy.org/profile/Viruslala/ and you get the link with requests.get. The thing is that I would like to avoid using this link because it is taking more time to requests than the API link. I am going to scrape thousands of profiles. Do you have an other solution?
Does it take much more time than the API? And that is the only place I found the data i'm afraid.
When I will scale the script, yes it can take quite a lot of time (because there is a lot of images and other useless data with https://www.khanacademy.org/profile/Viruslala/). Wich is why I try to use as much as possible their API. Do you know why the data is not showing up on the API?
Is there API documentation? I assume either wrong endpoint or not provided
How can I find the API documentation? Also, if I want to run your script but with an other profile that doesn't display the desired data (https://www.khanacademy.org/profile/zoecod/). The output is None 0 0 (see with edits). Do you have an idea on how to get the disered data anyway?
|
0

What are you using to scrape the API? Urllib usually gets what you need

with urllib.request.urlopen("https://www.khanacademy.org/api/internal/user/kaid_896965538702696832878421/profile/widgets?lang=en&_=190427-0731-8941ef3f07bd_1556382106890") as url:
data = json.loads(url.read().decode())

From the API link it doesn't seem to contain any data for the userSummary, so there's nothing to scrape

2 Comments

I use requests: requests.get('https://www.khanacademy.org/api/internal/user/kaid_896965538702696832878421/profile/widgets?lang=en&_=190424-1429-bcf153233dc9_1556201931959'.format(kaid)).json()
Indeed there is nothing and that's what I would like ton understand. Why?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.