0


So, I have the following code to scrap the data from a website and it's working without any problem.
My "issue" now it's that I need to run the code trought multiple webpages because the website I'm scraping has a pagination script.
Eg: One single page has 48 records, but in most of the cases the page has 200+ records but they are sub-divided on 3/4 pages.
My code:

Public Sub Roupa()
    Dim data As Object, i As Long, html As HTMLDocument, r As Long, c As Long, item As Object, div As Object
    Set html = New HTMLDocument                  '<== VBE > Tools > References > Microsoft HTML Object Library
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "https://www.worten.pt/grandes-eletrodomesticos/maquinas-de-roupa/maquinas-de-roupa-ver-todos-marca-BALAY-e-BOSCH-e-SIEMENS?per_page=100", False
        .send
        html.body.innerHTML = .responseText
    End With
    Set data = html.getElementsByClassName("w-product__content")
    For Each item In data
        r = r + 1: c = 1
        For Each div In item.getElementsByTagName("div")
            With ThisWorkbook.Worksheets("Roupa")
                .Cells(r, c) = div.innerText
            End With
            c = c + 1
        Next
    Next
    Sheets("Roupa").Range("A:A,C:C,F:F,G:G,H:H,I:I").EntireColumn.Delete
End Sub

UPDATE
I've tried adding this For n = 1 To 2 before the With, it works but I need to know the exact number of pages so that's not so helpful..

3
  • What if you just try changing this in the URL - per_page=100 to say per_page=100000? Commented Mar 1, 2019 at 15:59
  • I've already try that, actually the page only loads 48 records the per_page=100 was already my attempt Commented Mar 1, 2019 at 16:14
  • @QHarr already provided nice approach to derive number of pages from w-filters__element. However another approach (I generally use) is to go in a loop (increasing pages) till next page number is found in the pagination ListElement's inner text (in this case pagination text-center or Div class w-pagination-block) Commented Mar 1, 2019 at 19:28

1 Answer 1

1

Work out how many pages there are by dividing the result count by the results per page. Then do a loop concatenating the appropriate page number onto the url

Option Explicit
Public Sub Roupa()
    Dim data As Object, i As Long, html As HTMLDocument, r As Long, c As Long, item As Object, div As Object
    Set html = New HTMLDocument                  '<== VBE > Tools > References > Microsoft HTML Object Library
    Const RESULTS_PER_PAGE As Long = 48
    Const START_URL As String = "https://www.worten.pt/grandes-eletrodomesticos/maquinas-de-roupa/maquinas-de-roupa-ver-todos-marca-BALAY-e-BOSCH-e-SIEMENS?per_page=" & RESULTS_PER_PAGE & "&page=1"

    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", START_URL, False
        .setRequestHeader "User-Agent", "Mozilla/5.0"
        .send
        html.body.innerHTML = .responseText
        Dim numPages As Long, numResults As Long, arr() As String
        arr = Split(html.querySelector(".w-filters__element").innerText, Chr$(32))
        numResults = arr(UBound(arr))
        numPages = 1
        If numResults > RESULTS_PER_PAGE Then
            numPages = Application.RoundUp(numResults / RESULTS_PER_PAGE, 0)
        End If

        For i = 1 To numPages
             If i > 1 Then
                .Open "GET", Replace$("https://www.worten.pt/grandes-eletrodomesticos/maquinas-de-roupa/maquinas-de-roupa-ver-todos-marca-BALAY-e-BOSCH-e-SIEMENS?per_page=" & RESULTS_PER_PAGE & "&page=1", "page=1", "page=" & i), False
                .setRequestHeader "User-Agent", "Mozilla/5.0"
                .send
                 html.body.innerHTML = .responseText
            End If
            Set data = html.getElementsByClassName("w-product__content")
            For Each item In data
                r = r + 1: c = 1
                For Each div In item.getElementsByTagName("div")
                    With ThisWorkbook.Worksheets("Roupa")
                        .Cells(r, c) = div.innerText
                    End With
                    c = c + 1
                Next
            Next
        Next
    End With
    Sheets("Roupa").Range("A:A,C:C,F:F,G:G,H:H,I:I").EntireColumn.Delete
End Sub

Thinking about about what @AhmedAu said, if page has loaded properly, looks like a good way to also get page count is to simply use:

numPages = html.querySelectorAll("[data-page]").Length
Sign up to request clarification or add additional context in comments.

2 Comments

Damn you're a genius !! Would u have some good courses ? I would love learning more!!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.