1

I'm currrently trying to scrape info from this Reddit Page. My goal is to make excel open all the posts in new tabs and then I want to scrape information from each of those pages, since the starting page doesn't have as much information.

I've been trying for the last few hours to figure this out, but I'm admittedly pretty confused about how to do it, just overall unsure what to do next, so any pointers would be greatly appreciated!

Here is my current code, it works decently enough but as I said, I'm not sure what I should do next to open the links it finds one by one and scrape each page for data. The links are scraped off that first page and then added to my spreadsheet right now, but if possible I'd like to just skip that step and scrape them all at once.

Thanks! :)

Sub GetData()

Dim objIE As InternetExplorer
Dim itemEle As Object
Dim upvote As Integer, awards As Integer, animated As Integer
Dim postdate As String, upvotepercent As String, oc As String, filetype As String, linkurl As String, myhtmldata As String, visiComments As String, totalComments As String, removedComments As String
Dim y As Integer

Set objIE = New InternetExplorer
objIE.Visible = False

objIE.navigate (ActiveCell.Value)
Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop

y = 1

For Each itemEle In objIE.document.getElementsByClassName("flat-list buttons")
visiComments = itemEle.getElementsByTagName("a")(0).innerText
linkurl = itemEle.getElementsByTagName("a")(0).href
Sheets("Sheet1").Range("A" & y).Value = visiComments
Sheets("Sheet1").Range("B" & y).Value = linkurl
y = y + 1
Next

End Sub

3
  • @QHarr I'm basically trying to open each of the links (the hrefs) and then scrape a few html elements for each of them and output those to my spreadsheet. So the data to scrape would be say, for example the # of upvotes and the output would be a number. Commented May 4, 2020 at 19:16
  • The % Upvoted is the only additional info those pages have, yes, but it's pretty important for my project and I'm just trying to automate as much as possible. Commented May 4, 2020 at 20:24
  • Yep! Because the percentage is what's got me stuck, really. Commented May 4, 2020 at 20:36

1 Answer 1

2

You should be able to gather the urls then visit in a loop and write results from page visited to array, then array to sheet. Add this after your existing line

Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop

Add:

Dim nodeList As Object , i As Long, urls(), results()

Note: You are only potentially gaining on the page loads, as VBA is single threaded. To do that you would need to store a reference to each tab, or open all first, then loop through relevant open windows to do the scrape. My preference would be to keep in same tab to be honest.

Set nodeList = ie.document.querySelectorAll(".comments")
Redim urls(0 To nodeList.Length-1)
Redim results(1 to nodeList.Length, 1 to 3)
'Store all urls in an array to later loop
For i = 0 To nodeList.Length -1 
    urls(i) = nodeList.item(i).href
Next

For i = LBound(urls) To UBound(urls)
    ie.Navigate2   urls(i)
    While ie.Busy Or ie.Readystate <> 4: DoEvents:Wend
    'may need a pause here
    results(i + 1, 1) = ie.document.querySelector("a.title").innerText 'title
    results(i + 1, 2) = ie.document.querySelector(".number").innerText 'upvotes
    results(i + 1, 3) = ie.document.querySelector(".word").NextSibling.nodeValue '%
Next
ActiveSheet.Cells(1,1).Resize(UBound(results,1) , UBound(results,2)) = results
Sign up to request clarification or add additional context in comments.

5 Comments

Does .NodeValue work similarly how .next_sibling works in BeautifulSoup @QHarr?
Sorry if I took time to reply, I'm trying to understand and not just copy ^^ For some reason it's scraping the title of the first post in the list just fine, along with the upvotes, but not the %. And then after the macro finishes I end up with the first post (and its upvotes) repeating over 25 rows instead of all the different posts. I can't figure out what's causing that.
I checked the HTML and there's another CSS class called "word" that's technically below the one I want, that might be what's causing issues with the % though that's probably not why it's not scraping the other posts.
that fixed the first problem, thanks! and yeah, it's writing out [object Text].
Weirdly enough, It's telling me that the "object doesn't support this property or method".

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.