3

I've been working on a project to reverse-enginner twitter's app to scrape public posts from Twitter using an unofficial API, with Python. (I want to create an "alternative" app, which is simply a localhost that can search for a user, and get its posts)

I've been searching and reading everything related to REST, AJAX, and the python modules requests, requests-html, BeautifulSoup, and more.

I can see when looking at twitter on the devtools (for example on Marvel's profile page) that the only relevant requests being sent (by POST and GET) are the following: client_event.json and UserTweets?variables=... . I understood that these are the relevant messages being received by cleaning the network tab and recording only when I scroll down and load new tweets - these are the only messages that came up which aren't random videos (I cleaned the search using -video -init -csp_report -config -ondemand -like -pageview -recommendations -prefetch -jot -key_live_kn -svg -jpg -jpeg -png -ico -analytics -loader -sharedCore -Hebrew).

I am new to this field, so I am probably doing something wrong. I can see on UserTweets the response I'm looking for - a beautiful JSON with all the data I need - but I am unable, no matter how much I've been trying to, to access it.

I tried different modules and different headers, and I get nothing. I DON'T want to use Selenium since it's tiresome, and I know where the data I need is stored. The JSON I want

I've been trying to send a GET reuest to: https://twitter.com/i/api/graphql/vamMfA41UoKXUmppa9PhSw/UserTweets?variables=%7B%22userId%22%3A%2215687962%22%2C%22count%22%3A20%2C%22cursor%22%3A%22HBaIgLLN%2BKGEryYAAA%3D%3D%22%2C%22withHighlightedLabel%22%3Atrue%2C%22withTweetQuoteCount%22%3Atrue%2C%22includePromotedContent%22%3Atrue%2C%22withTweetResult%22%3Afalse%2C%22withUserResults%22%3Afalse%2C%22withVoice%22%3Afalse%2C%22withNonLegacyCard%22%3Atrue%7D

by doing:

from requests_html import HTMLSession
from bs4 import BeautifulSoup

response = session.get('https://twitter.com/i/api/graphql/vamMfA41UoKXUmppa9PhSw/UserTweets?variables=%7B%22userId%22%3A%2215687962%22%2C%22count%22%3A20%2C%22cursor%22%3A%22HBaIgLLN%2BKGEryYAAA%3D%3D%22%2C%22withHighlightedLabel%22%3Atrue%2C%22withTweetQuoteCount%22%3Atrue%2C%22includePromotedContent%22%3Atrue%2C%22withTweetResult%22%3Afalse%2C%22withUserResults%22%3Afalse%2C%22withVoice%22%3Afalse%2C%22withNonLegacyCard%22%3Atrue%7D')
response.html.render()
s = BeautifulSoup(response.html.html, 'lxml')

but I get back an HTML script that either says Chromium is unsupported, or just a static page without the javascript updating the DOM.

All help appreciated.

Thank you

P.S I've posted the same question on reverseengineering.stackexchange, just to be safe (overflow has more appropriate tags :-))

9
  • It might be refusing the connection because it knows you're trying to scrape it. Maybe try imitating the "User-Agent" header of a common browser like firefox? Commented Apr 17, 2021 at 10:58
  • 1
    i had the same problem with twitter and Instagram. i ended up using official free api that twitter provides and used selenium for Instagram. since some of the most popular twitter scraping packages in GitHub aren't working anymore, i came to the conclusion there is no clean way to do it. Commented Apr 17, 2021 at 11:04
  • Since everything in twitter (ie the DOM) is generated dynamically using Express/React it's probably quite hard to get xpath to work consistently. Did you examine if the rendering data from __INITIAL_STATE__ might be of use? Commented Apr 17, 2021 at 11:54
  • @IODEV Some xpaths are hard to find, but it's still possible. For example, you could search a class by the part of its name. Commented Apr 17, 2021 at 12:32
  • @CmdCoder858 I honestly don't know what that means. Commented Apr 17, 2021 at 12:45

4 Answers 4

2

Before you deep dive into the actual code, I would first start building the correct request to twitter. I would use a 3rd party tool focused on REST and APIs such as Postman to build and test the required request - and only then would write the actual code.

From your questions it seems that you'll be using an open API of twitter, so it means you'll only need to send x-guest-token and basic Bearer authorization in your request headers.

  • The Bearer is static - you can just browse to twitter and copy/paste it from the dev tools network monitor.
  • To get the x-guest-token you'll need something dynamic because it has expiration, what I would suggest is send a curl request to twitter, parse the token from there and put it in your header before sending the request. You can see something very similar in: Python Downloading twitter video using python (without using twitter api) .

After you have both of the above, build the required GET request in Postman and test if you get back the correct response. Only after you have everything working in Postman - write the same in Python, or any other language**

**You can use Postman snippets which automatically generates the code needed in many programming languages.

Sign up to request clarification or add additional context in comments.

Comments

0

@TripleS, example of how one may extract json data from __INITIAL_STATE__ and write it to text file.

import requests
import re
import json
from contextlib import suppress

# get page
result = requests.get('https://twitter.com/ThePSF')


# Extract json from "window.__INITIAL_STATE__={....};
json_string = re.search(r"window.__INITIAL_STATE__\s?=\s?(\{.*?\});", result.text).group(1)

# convert text string to structured json data
twitter_json = json.loads(json_string)

# Save structured json data to a text file that may help
# you to orient yourself and possible pick some parts you
# are interested in (if there are any)
with open('twitter_json_data.txt', 'w') as outfile:
    outfile.write(json.dumps(twitter_json, indent=4, sort_keys=True))

3 Comments

thanks for the reply! Might be a silly question, but what should I be looking for? from a profile I was trying to work on, I can see no useful data whatsoever. The main question is how to get the response that I can spot on the network tab under devtools, or get a JavaScript-rendered DOM of that profile.
Sorry but I'm not sure since i don't know Twitter very well. You have to decide for yourself if the content is useful for your needs or not. Sometimes sites scrambles the content to make it harder to scrape the rendering data.
Btw, hove you tried the "XPath helper" addon for Chrome. It's quite capable to pick xpath that doesn't suck :-) Install, activate and hold down the shift key while pointing at an element to generate xpath.
0

I've just tried the same, but with requests, not requests_html module. I could get all site contents, but I would not call it "beautiful".

Also, now I am blocked to access the site without logging in. Here is my small example. Use official Twitter API instead.

I also think that I will probably be blocked after some tries of using this script. I've tried it only 2 times.

import requests
import bs4

def example():
    result = requests.get("https://twitter.com/childrightscnct")
    soup = bs4.BeautifulSoup(result.text, "lxml")
    print(soup)

if __name__ == '__main__':
    example()

To select any element with bs4, use

some_text = soup.select('locator').getText()

I found one tool for scraping Twitter, that has quite a lot of stars on Github https://github.com/twintproject/twint I did not try it myself and hope it is legal.

4 Comments

After my first try, I cannot access twitter manually without logging in, but I still can accept it via requests. Probably, because I did not reach my "request limit". They even have limits for their own API developer.twitter.com/en/docs/twitter-api/rate-limits 900 requests/15-minutes
Hi, thanks for the reply! I tried with requests too. I could NOT get all the contents, how do you see any tweets? I am trying to show proof of concept of doing it WITHOUT the official api (and your example doesn't use the official api, anyway).
That's why I have been trying to use the unofficial api. I can see the JSON, I cannot get it in any way I could think of
You probably will be blocked, because this is against the Twitter Terms of Service twitter.com/en/tos
0

What you're missing is the bearer and guest token needed to make your request. If I just hit your endpoint with curl and no headers I get no response. However, if I add headers for the bearer token and guest token then I get that json you're looking for:

curl https://twitter.com/i/api/graphql/vamMfA41UoKXUmppa9PhSw/UserTweets?variables=%7B%22userId%22%3A%2215687962%22%2C%22count%22%3A20%2C%22cursor%22%3A%22HBaIgLLN%2BKGEryYAAA%3D%3D%22%2C%22withHighlightedLabel%22%3Atrue%2C%22withTweetQuoteCount%22%3Atrue%2C%22includePromotedContent%22%3Atrue%2C%22withTweetResult%22%3Afalse%2C%22withUserResults%22%3Afalse%2C%22withVoice%22%3Afalse%2C%22withNonLegacyCard%22%3Atrue%7D -H 'authorization: Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA'' -H 'x-guest-token: 1452696114205847552'

You can get the bearer token (which may not expire that often) and the guest token (which does expire, I think) like this:

  1. The html of the twitter link you go to links a file called main.some random numbers.js. Within that javascript file is the bearer token. You can recognize it is because a long string starting with lots of A's.
  2. Take the bearer token and call https://api.twitter.com/1.1/guest/activate.json using the bearer token as an authorization header

curl 'https://api.twitter.com/1.1/guest/activate.json' -X POST -H 'authorization: Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA'

In python this looks like:

import requests
import json

url = "https://twitter.com/i/api/graphql/vamMfA41UoKXUmppa9PhSw/UserTweets?variables=%7B%22userId%22%3A%2215687962%22%2C%22count%22%3A20%2C%22cursor%22%3A%22HBaIgLLN%2BKGEryYAAA%3D%3D%22%2C%22withHighlightedLabel%22%3Atrue%2C%22withTweetQuoteCount%22%3Atrue%2C%22includePromotedContent%22%3Atrue%2C%22withTweetResult%22%3Afalse%2C%22withUserResults%22%3Afalse%2C%22withVoice%22%3Afalse%2C%22withNonLegacyCard%22%3Atrue%7D"
headers = {"authorization": "Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA", "x-guest-token": "1452696114205847552"}
resp = requests.get(url, headers=headers)
j = json.loads(resp.text)

And now, that variable, j, holds your beautiful json. One warning, sometimes the response back can be so big that it doesn't seem to fit into a single response. If this happens, you'll notice the resp.text isn't valid json, but just some portion of a big blog of json. To fix this, you'll just need to adapt the requests to use "stream=True" and stream out the whole response before you try to parse it as json.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.