Get HTML content with variable tags and extract innertext with VBA for EXcel

Question

I would like to get just the number between "ca." and "m²" from the text-row. How to do it with VBA to avoid additional string formulas in Excel?

Problem is also that innertext in HTML content is sometimes in tr.td.p-tag*, other times only in tr.td-tag (without p) and sometimes in tr.td.b-tag, in this case "Description" is replaced with "Appointment" in the according td-tag.

Is there a VBA code to check&extract with queryselectorall? Something like:

myString01 = html.queryselectorall(tr td).item(x).innertext

If InStr(myString, "DESCRIPTION") > 0 Then 
'NEED VBA CODE, value must be the number of innerText in td.p or td 
Else if 
   InStr(myString, "APPOINTEENT") > 0 Then 
'NEED VBA CODE, value must be the last word of innerText in td.b
end if

These are the 3 different snippets for the same property of different items:

<tr>
<td valign="top" align="left">Description:</td>
<td valign="top" align="left">
<p>
textA textB textC ca. 140 m².
</p>
</td>
</tr>

<tr>
<td valign="top" align="left">Description:</td>
<td valign="top" align="left">
textA textB textC ca. 85 m².
</td>
</tr>

<tr>
<td valign="top" align="left">Appointment</td>>
<td valign="top" align="left">
<b>
textA textB textC canceled!
</b>
</td>
</tr>

1. If you can identify the right tr tag and the only text in td tag or p tag is the same it's enough to get the innertext from the tr tag. Other tags inside the tr tag will be ignored than. 2. Use split() with the innertext and get the second last element. Than you have what you want. learn.microsoft.com/de-de/office/vba/language/reference/… 3. If you need more infos it's always the same: Please post the url in question. — Zwenn
– Zwenn, Commented Apr 18, 2021 at 11:43
@Zwenn:Thanks! How to identify the TR-tags without any id, name etc? They are all just parents of TD-tags and children of table.tbody and their count is also always different. — Jasco
– Jasco, Commented Apr 18, 2021 at 12:25
Only you know the whole html. Trying to get the right table tag and extract all tr tags would be my approach. But I don't know if there are a way to identify the right table tag. Please look at point 3. of my comment. — Zwenn
– Zwenn, Commented Apr 18, 2021 at 12:45

QHarr · Accepted Answer · 2021-04-21 06:37:51Z

2

You could extract the links to the detail documents during a post request then visit each of those links with internet explorer, ensuring to provide the right referer header; then use regex to grab that measurement.

TODO: Code really needs a re-factor as there is a lot going on in the main sub. Really each sub/function should be doing c. one thing.

Option Explicit

Public Sub GetDataZvgPort()
    Const URL = "https://www.zvg-portal.de/index.php?button=Suchen"
    Dim html As MSHTML.HTMLDocument, xhr As Object

    Set html = New MSHTML.HTMLDocument
    Set xhr = CreateObject("MSXML2.ServerXMLHTTP.6.0")

    With xhr
        .Open "POST", URL, False
        .setRequestHeader "Content-Type", "application/x-www-form-urlencoded"
        .send "land_abk=ni&ger_name=Peine&order_by=2&ger_id=P2411"
        html.body.innerHTML = .responseText
    End With

    Dim table As MSHTML.HTMLTable, r As Long, c As Long, headers(), row As MSHTML.HTMLTableRow
    Dim results() As Variant, html2 As MSHTML.HTMLDocument

    headers = Array("Aktenzeichen", "Amtsgericht", "Objekt/Lage", "Verkehrswert in €", "Termin", "Pdf-Link", "Addit Info Link", "m²")

    ReDim results(1 To 100, 1 To UBound(headers) + 1)

    Set table = html.querySelector("table")
    Set html2 = New MSHTML.HTMLDocument

    Dim lastRow As Boolean

    For Each row In table.Rows
        lastRow = False
        Dim header As String

        html2.body.innerHTML = row.innerHTML
        header = Trim$(row.Children(0).innerText)

        If header = "Aktenzeichen" Then          'start of new block. Assumes all blocks have this
            r = r + 1
            Dim dict As Scripting.Dictionary: Set dict = GetBlankDictionary(headers)
            On Error Resume Next
            dict("Addit Info Link") = Replace$(html2.querySelector("a").href, "about:", "https://www.zvg-portal.de/")
            On Error GoTo 0
        End If

        If dict.Exists(header) Then dict(header) = Trim$(row.Children(1).innerText)

        If (header = vbNullString And html2.querySelectorAll("a").Length > 0) Then
            dict("Pdf-Link") = Replace$(html2.querySelector("a").href, "about:blank", "https://www.zvg-portal.de/index.php")
            lastRow = True
        ElseIf header = "Termin" Then
            If row.NextSibling.NodeType = 1 Then lastRow = True
        End If

        If lastRow Then
            populateArrayFromDict dict, results, r
        End If
    Next

    results = Application.Transpose(results)
    ReDim Preserve results(1 To UBound(headers) + 1, 1 To r)
    results = Application.Transpose(results)
    
    Dim re As Object
    
    Set re = CreateObject("VBScript.RegExp")
    
    With re
        .Global = False
        .MultiLine = False
        .IgnoreCase = True
        .Pattern = "\s([0-9.]+)\sm²"
    End With

    Dim ie As SHDocVw.InternetExplorer
    
    Set ie = New SHDocVw.InternetExplorer
    
    With ie
        .Visible = True
        
        For r = LBound(results, 1) To UBound(results, 1)
            
            If results(r, 7) <> vbNullString Then
                
                .Navigate2 results(r, 7), headers:="Referer: " & URL
                
                While .Busy Or .readyState <> READYSTATE_COMPLETE: DoEvents: Wend
 
                'On Error Resume Next
                results(r, 8) = re.Execute(.document.querySelector("#anzeige").innerHTML)(0).Submatches(0)
                'On Error GoTo 0
   
            End If
            
        Next
        
        .Quit
        
    End With
    
    With ActiveSheet
        .Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
        .Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
    End With

End Sub

Public Sub populateArrayFromDict(ByVal dict As Scripting.Dictionary, ByRef results() As Variant, ByVal r As Long)
    Dim key As Variant, c As Long

    For Each key In dict.Keys
        c = c + 1
        results(r, c) = Replace$(dict(key), " (Detailansicht)", vbNullString)
    Next

End Sub

Public Function GetBlankDictionary(ByRef headers() As Variant) As Scripting.Dictionary
    Dim dict As Scripting.Dictionary, i As Long

    Set dict = New Scripting.Dictionary

    For i = LBound(headers) To UBound(headers)
        dict(headers(i)) = vbNullString
    Next

    Set GetBlankDictionary = dict
End Function

edited Apr 21, 2021 at 6:37

answered Apr 18, 2021 at 14:58

QHarr

84.5k14 gold badges58 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

Zwenn Over a year ago

Nice. A little more complex than I thought ;-)

QHarr Over a year ago

I still think there are easier ways. Python doesn't need browser automation for any of this. I can't quite pin down what extra info (except referer) that needs to go with a GET request to retrieve the information, in the later stages, without error.

Jasco Over a year ago

I was quite proud when I learned to add .item(x) to queryselectorall. How should I have come up with .pattern = "\s([0-9.]+)\sm²"??? It almost reads (and feels) like e=mc² lol

QHarr Over a year ago

I don't think that is the reason as . is included in the character set. See regex101.com/r/WAJqAh/1 . Please provide the exact string that is failing via pastebin.com or instructions on how to reach that result on the website.

QHarr Over a year ago

change the flag to global = true then you will need to cycle through the matches. Dim matches As Object, match As Object: Set matches = re.execute(.......) then For Each match in Matches. As you cycle through matches empty the extracted match value into an array. At the end join the array with "=" & Join(arr, "+") . The array you can Dim arr() at same time as Dim matches, then after Set Matches line ReDim arr(1 to matches.count) See stackoverflow.com/questions/22542834/…

|

Collectives™ on Stack Overflow

Get HTML content with variable tags and extract innertext with VBA for EXcel

1 Answer 1

10 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

10 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related