0

Brief description

I'm trying to search posts on instagram by a given hashtag using bs4.
The tag I'm looking for is div class="v1Nh3:
enter image description here

What I did

target = "https://www.instagram.com/explore/tags/test"
html = requests.get(target)
soup = BeautifulSoup(html.content,"html.parser")
root = soup.find(id="react-root")
posts = soup.find_all("div",class_="v1Nh3")

However if I print the variable posts I get an empty list. When printing rootit shows some weird result:

<div id="react-root">
    <span><svg height="50" style="position:absolute;top:50%;left:50%;margin:-25px 0 0 -25px;fill:#c7c7c7" viewbox="0 0 50 50" width="50"><path d="M25 1c-6.52 0-7.34.03-9.9.14-2.55.12-4.3.53-5.82 1.12a11.76 11.76 0 0 0-4.25 2.77 11.76 11.76 0 0 0-2.77 4.25c-.6 1.52-1 3.27-1.12 5.82C1.03 17.66 1 18.48 1 25c0 6.5.03 7.33.14 9.88.12 2.56.53 4.3 1.12 5.83a11.76 11.76 0 0 0 2.77 4.25 11.76 11.76 0 0 0 4.25 2.77c1.52.59 3.27 1 5.82 1.11 2.56.12 3.38.14 9.9.14 6.5 0 7.33-.02 9.88-.14 2.56-.12 4.3-.52 5.83-1.11a11.76 11.76 0 0 0 4.25-2.77 11.76 11.76 0 0 0 2.77-4.25c.59-1.53 1-3.27 1.11-5.83.12-2.55.14-3.37.14-9.89 0-6.51-.02-7.33-.14-9.89-.12-2.55-.52-4.3-1.11-5.82a11.76 11.76 0 0 0-2.77-4.25 11.76 11.76 0 0 0-4.25-2.77c-1.53-.6-3.27-1-5.83-1.12A170.2 170.2 0 0 0 25 1zm0 4.32c6.4 0 7.16.03 9.69.14 2.34.11 3.6.5 4.45.83 1.12.43 1.92.95 2.76 1.8a7.43 7.43 0 0 1 1.8 2.75c.32.85.72 2.12.82 4.46.12 2.53.14 3.29.14 9.7 0 6.4-.02 7.16-.14 9.69-.1 2.34-.5 3.6-.82 4.45a7.43 7.43 0 0 1-1.8 2.76 7.43 7.43 0 0 1-2.76 1.8c-.84.32-2.11.72-4.45.82-2.53.12-3.3.14-9.7.14-6.4 0-7.16-.02-9.7-.14-2.33-.1-3.6-.5-4.45-.82a7.43 7.43 0 0 1-2.76-1.8 7.43 7.43 0 0 1-1.8-2.76c-.32-.84-.71-2.11-.82-4.45a166.5 166.5 0 0 1-.14-9.7c0-6.4.03-7.16.14-9.7.11-2.33.5-3.6.83-4.45a7.43 7.43 0 0 1 1.8-2.76 7.43 7.43 0 0 1 2.75-1.8c.85-.32 2.12-.71 4.46-.82 2.53-.11 3.29-.14 9.7-.14zm0 7.35a12.32 12.32 0 1 0 0 24.64 12.32 12.32 0 0 0 0-24.64zM25 33a8 8 0 1 1 0-16 8 8 0 0 1 0 16zm15.68-20.8a2.88 2.88 0 1 0-5.76 0 2.88 2.88 0 0 0 5.76 0z"></path></svg></span>
</div>

I guess this behaviour has something to do with react but I'm not sure.So my questions are:
- Why is this happening?
- Is it possible to accomplish it with bs4 and how?
- In case isn't possible could I do it with selenium or other tools

2
  • you will have to use selenium, what data are you trying to scrap? Commented Apr 25, 2020 at 0:34
  • @0m3r The posts of a given hashtag to get their post URL, afterwards check the post info (image and description) Commented Apr 25, 2020 at 0:36

1 Answer 1

1

The tag posts data is contained on json object on the source code of the page, i.e.:

import requests, json, re

u = "https://www.instagram.com/explore/tags/test/"
html = requests.get(u).text

matches = re.findall(r"window\._sharedData = (\{.*:false\});</script>", html, re.IGNORECASE | re.MULTILINE)
if matches:
    test = json.loads(matches[0])
    # browse the json object at https://jsoneditoronline.org/#left=cloud.5931e80efee541f69a856daf31a96d1b

    for n in test['entry_data']['TagPage'][0]['graphql']['hashtag']['edge_hashtag_to_media']['edges']:
        shortcode = n['node']['shortcode']
        display_url = n['node']['display_url']
        thumbnail_src = n['node']['thumbnail_src']
        is_video = n['node']['is_video']
        accessibility_caption = n['node']['accessibility_caption']
        taken_at_timestamp = n['node']['taken_at_timestamp']
        owner = n['node']['owner']['id']
        edge_liked_by = n['node']['edge_liked_by']['count']
        # ...

        print(shortcode, display_url, edge_liked_by, owner)

        if n['node']['edge_media_to_caption']['edges']:
            for tags in n['node']['edge_media_to_caption']['edges']:
                post_text = tags['node']['text']
                print(post_text)

B_YtTM0nAe3 https://scontent-iad3-1.cdninstagram.com/v/t51.2885-15/e35/94292033_2909314699138403_547642880292448809_n.jpg?_nc_ht=scontent-iad3-1.cdninstagram.com&_nc_cat=111&_nc_ohc=LPorzz52Fe8AX-80CFG&oh=7a0379f33ec4f42ad36506da32bb40ae&oe=5ECB8677 1 5912200464
#mcq #602 #commercenewsguruji #bestoftheday #instadaily #instalike #igdaily #igers  #instalove #instagood #instadaily #dailymcq #mcqswithanswers #assessment #assessmenttest #assessmentmcq #test #practice
...

Demo

Sign up to request clarification or add additional context in comments.

2 Comments

Awesome job, could you briefly explain how r"window\._sharedData = (\{.*:false\});</script>"' works?
Sure, it's a simple regex to grab the json object, please check this link for a detailed explanation.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.