2

I would like to extract data from betexplorer.com. I want to extract two different pieces of data from the following URL:

https://www.betexplorer.com/soccer/s...eague-1/stats/

I would like to extract Matches Played and Matches Remaining I would like to extract Home Goals and Away Goals (per match)

I have the the code to do that and it is as follows:

Option Explicit

Sub GetSoccerStats()


'Set a reference (VBE > Tools > References) to the following libraries:
'   1) Microsoft XML, v6.0
'   2) Microsoft HTML Object Library

Dim xmlReq As New MSXML2.XMLHTTP60
Dim objDoc As New MSHTML.HTMLDocument
Dim objTable As MSHTML.htmlTable
Dim objTableRow As MSHTML.htmlTableRow
Dim strURL As String
Dim strResp As String
Dim strText As String
Dim rw As Long

strURL = "https://www.betexplorer.com/soccer/south-korea/k-league-1/stats/"

With xmlReq
    .Open "GET", strURL, False
    .send
    If .Status <> 200 Then
        MsgBox "Error " & .Status & ":  " & .statusText
        Exit Sub
    End If
    strResp = .responseText
End With

Worksheets.Add

objDoc.body.innerHTML = strResp

Set objTable = objDoc.getElementsByClassName("table-main leaguestats")(0)

If Not objTable Is Nothing Then
    rw = 1
    For Each objTableRow In objTable.Rows
        strText = objTableRow.Cells(0).innerText
        Select Case strText
            Case "Matches played", "Matches remaining", "Home goals", "Away goals"
                Cells(rw, "a").Value = objTableRow.Cells(0).innerText
                Cells(rw, "b").Value = objTableRow.Cells(1).innerText
                Cells(rw, "c").Value = objTableRow.Cells(2).innerText
                rw = rw + 1
        End Select
    Next objTableRow
    Columns("a").AutoFit
End If

Set xmlReq = Nothing
Set objDoc = Nothing
Set objTable = Nothing
Set objTableRow = Nothing


End Sub

This code works however i want to take it a step further.

I actually want to run this macro for many different URL's on the same site. I have a worksheet already created that has a list of Football Leagues (in the rows), the columns hold the data.

You can find the file here : https://www.dropbox.com/s/77sol24sty75w5z/Avg%20Goals.xlsm?dl=0

This is a file where i will add leagues to the rows as i go. Is it possible to adapt the code that extracts the data so that it can populate the columns in my sheet? I do not need to pull in the names of the data (matches remaining, home goals, away goals etc) as this code does, i only need the figures. The extracted figures would have to populate the columns as per the sheet (so each row contains the data for each league. As you can see there are a few leagues so it would need to loop through each row and then use the corresponding URL for that row.

You will notice that there is a column that contains the word CURRENT. This is to indicate that it should use the URL in the Current URL column. If I change the value to LAST i would like it to use the URL in the Last URL column.

For each league it will be different if I use CURRENT or LAST.

Here is a picture of expected output:

expectedoutput

Any help is greatly appreciated.

4
  • A little mock up of expected output including with question would help perhaps showing first couple of output rows. That could be inserted as an image. Commented Apr 27, 2019 at 14:09
  • And what is the difference between current and last urls? Do you have examples of each? Commented Apr 27, 2019 at 14:19
  • We should be able to answer this without referring to an external file. That said: is your dropbox file public? I see 404 not found Commented Apr 27, 2019 at 14:20
  • 1
    I have updated the link and added a photo Commented Apr 27, 2019 at 15:07

2 Answers 2

2

Keeping in line with your code this will output the data for those items in columns M:T. I have a helper function, GetLinks, which generates an array of final urls to used based on the value in column K:

inputArray = GetLinks(inputArray)

This array is looped and xhr requests are issued for the information. All the results information is stored in an array, results, which is written out in one go to the sheet at the end.

I work with array throughout as you don't want to keep reading from the sheet; that is an expensive operation which slows your code. For the same reason, if <> 200 occurs, I print to the immediate window the message and the url so as to not slow the code. You effectively have a log then you can review at the end.

The retrieved results are written out from column M, but as the data is in array, you can easily write out to where ever you want; simply change the start cell for pasting from M4 to which ever top leftmost cell you want. Your existing columns do not have percentages in, so I felt safe to assume you expected the written out data to be in new columns (possibly even in a different sheet).

Option Explicit   
Public Sub GetSoccerStats()
    Dim xmlReq As New MSXML2.XMLHTTP60, response As String
    Dim objDoc As New MSHTML.HTMLDocument, text As String
    Dim lastRow As Long, dataSheet As Worksheet, inputArray(), i As Long

    Set dataSheet = ThisWorkbook.Worksheets("AVG GOAL DATA")

    With dataSheet
        lastRow = .Cells(.Rows.Count, "B").End(xlUp).Row
    End With

    inputArray = dataSheet.Range("J4:L" & lastRow).Value
    inputArray = GetLinks(inputArray)

    Dim results(), r As Long, c As Long
    ReDim results(1 To UBound(inputArray, 1), 1 To 8)

    With xmlReq

        For i = LBound(inputArray, 1) To UBound(inputArray, 1)
            r = r + 1
            .Open "GET", inputArray(i, 4), False
            .send
            If .Status <> 200 Then
                Debug.Print inputArray(i, 4), vbTab, "Error " & .Status & ":  " & .statusText
            Else
                response = .responseText
                objDoc.body.innerHTML = response

                Dim objTable As MSHTML.HTMLTable, objTableRow As MSHTML.HTMLTableRow

                Set objTable = objDoc.getElementsByClassName("table-main leaguestats")(0)

                If Not objTable Is Nothing Then
                    c = 1
                    For Each objTableRow In objTable.Rows
                        text = objTableRow.Cells(0).innerText
                        Select Case text
                        Case "Matches played", "Matches remaining", "Home goals", "Away goals"
                            results(r, c) = objTableRow.Cells(1).innerText
                            results(r, c + 1) = objTableRow.Cells(2).innerText
                            c = c + 2
                        End Select
                    Next objTableRow
                End If
            End If
            Set objTable = Nothing
        Next
    End With
    dataSheet.Range("M4").Resize(UBound(results, 1), UBound(results, 2)) = results
End Sub

Public Function GetLinks(ByRef inputArray As Variant) As Variant
    Dim i As Long
    ReDim Preserve inputArray(1 To UBound(inputArray, 1), 1 To UBound(inputArray, 2) + 1)

    For i = LBound(inputArray, 1) To UBound(inputArray, 1)
        inputArray(i, 4) = IIf(inputArray(i, 1) = "CURRENT", inputArray(i, 2), inputArray(i, 3))
    Next
    GetLinks = inputArray
End Function

Layout of file:

enter image description here


Given large number of requests led to blocking here is IE version:

'VBE > Tools > References:
'1: Microsoft HTML Object library  2: Microsoft Internet Controls
Public Sub GetSoccerStats()
    Dim ie As Object, t As Date
    Dim objDoc As New MSHTML.HTMLDocument, text As String
    Dim lastRow As Long, dataSheet As Worksheet, inputArray(), i As Long

    Const MAX_WAIT_SEC As Long = 10

    Set dataSheet = ThisWorkbook.Worksheets("AVG GOAL DATA")
    Set ie = CreateObject("InternetExplorer.Application")
    With dataSheet
        lastRow = .Cells(.Rows.Count, "B").End(xlUp).Row
    End With

    inputArray = dataSheet.Range("C4:E" & lastRow).Value
    inputArray = GetLinks(inputArray)

    Dim results(), r As Long, c As Long
    ReDim results(1 To UBound(inputArray, 1), 1 To 8)

    With ie
        .Visible = True
        For i = LBound(inputArray, 1) To UBound(inputArray, 1)
            r = r + 1
            .navigate2 inputArray(i, 4)

            While .Busy Or .readyState < 4: DoEvents: Wend

            Dim objTable As MSHTML.HTMLTable, objTableRow As MSHTML.HTMLTableRow
            t = timer
            Do
                DoEvents
                On Error Resume Next
                Set objTable = .document.getElementsByClassName("table-main leaguestats")(0)
                On Error GoTo 0
                If Timer - t > MAX_WAIT_SEC Then Exit Do
            Loop While objTable Is Nothing

            If Not objTable Is Nothing Then
                c = 1
                For Each objTableRow In objTable.Rows
                    text = objTableRow.Cells(0).innerText
                    Select Case text
                    Case "Matches played", "Matches remaining", "Home goals", "Away goals"
                        results(r, c) = objTableRow.Cells(1).innerText
                        results(r, c + 1) = objTableRow.Cells(2).innerText
                        c = c + 2
                    End Select
                Next objTableRow
            End If
            Set objTable = Nothing
        Next
        .Quit
    End With
    dataSheet.Range("F4").Resize(UBound(results, 1), UBound(results, 2)) = results
End Sub
Sign up to request clarification or add additional context in comments.

16 Comments

Hi, this works well, so i have updated my list of leagues (there are now over 70) When i run the macro i get the following error : Run-time error '-2147024891 (80070005): Access is denied. Within the debug the error is on the .send of the xmlreq
The site is likely blocking you for too many requests too quickly. You may need to introduce waits during loop. Do you want to use same site and upload file with all leagues in for me?
I thought it might be a timing issue. Here is a link to the file dropbox.com/s/77sol24sty75w5z/Avg%20Goals.xlsm?dl=0
Agreed. If it works with a single/few requests but doesn't with lots it is likely you are being throttled/blocked
It's blocking based on IP. Even with changing IP it has another layer of protection it throws up. My guess is that this scraping is not allowed.
|
0

Maybe something like this might work:

Option Explicit

Private Sub GetSoccerStats()

    'Set a reference (VBE > Tools > References) to the following libraries:
    '   1) Microsoft XML, v6.0
    '   2) Microsoft HTML Object Library

    Dim sourceSheet As Worksheet
    Set sourceSheet = ThisWorkbook.Worksheets("AVG GOAL DATA")

    Dim firstRowToFetchDataFor As Long
    firstRowToFetchDataFor = sourceSheet.Cells(sourceSheet.Rows.Count, "C").End(xlUp).Row + 1 ' Assumes a row needs pulling if the value in column C is blank.

    Dim lastRowToFetchDataFor As Long
    lastRowToFetchDataFor = sourceSheet.Cells(sourceSheet.Rows.Count, "B").End(xlUp).Row

    Dim xmlReq As MSXML2.XMLHTTP60
    Set xmlReq = New MSXML2.XMLHTTP60

    Dim htmlDoc As MSHTML.HTMLDocument
    Set htmlDoc = New MSHTML.HTMLDocument

    Dim rowIndex As Long
    For rowIndex = firstRowToFetchDataFor To lastRowToFetchDataFor

        Dim URL As String
        Select Case LCase$(sourceSheet.Cells(rowIndex, "J"))
            Case "current"
                URL = sourceSheet.Cells(rowIndex, "K")
            Case "last"
                URL = sourceSheet.Cells(rowIndex, "L")
            Case Else
                MsgBox "Expected 'current' or 'last', instead got '" & sourceSheet.Cells(rowIndex, "J") & "' in cell '" & sourceSheet.Cells(rowIndex, "J").Address(False, False) & "'.", vbCritical
                Application.Goto sourceSheet.Cells(rowIndex, "J")
                Exit Sub
        End Select

        With xmlReq
            .Open "GET", URL, False
            .send
            If .Status <> 200 Then
                MsgBox "Request returned HTTP " & .Status & ":" & vbNewLine & vbNewLine & .statusText, vbCritical
                Exit Sub
            End If
            htmlDoc.body.innerHTML = .responseText
        End With

        Dim htmlTableExtracted As MSHTML.HTMLTable
        On Error Resume Next
        Set htmlTableExtracted = htmlDoc.getElementsByClassName("table-main leaguestats")(0)
        On Error GoTo 0

        If Not (htmlTableExtracted Is Nothing) Then
            Dim tableRow As MSHTML.HTMLTableRow
            For Each tableRow In htmlTableExtracted.Rows
                Select Case LCase$(tableRow.Cells(0).innerText)
                    Case "matches played"
                        sourceSheet.Cells(rowIndex, "G") = tableRow.Cells(1).innerText
                    Case "matches remaining"
                        sourceSheet.Cells(rowIndex, "H") = tableRow.Cells(1).innerText
                    Case "home goals"
                        sourceSheet.Cells(rowIndex, "C") = tableRow.Cells(2).innerText
                    Case "away goals"
                        sourceSheet.Cells(rowIndex, "E") = tableRow.Cells(2).innerText
                End Select
            Next tableRow

            Set htmlTableExtracted = Nothing ' Prevent this iteration's result having effects on succeeding iterations
        End If
    Next rowIndex
End Sub

I might be wrong, but shouldn't column E contain "away goals"? I've assumed the "A" in "A SCR AVG" stands for "Away" (since "H" in "H SCR AVG" seems to stand for "Home"). So I write "Away goals" to column E, even though the screenshot suggests they should be written to column B (or maybe I'm not reading correctly).

1 Comment

For some reason i cant see your macro to run when i paste it to in to the VBA module

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.