VBA - dealing with JavaScript content in XMLHTTP GET request

Question

I would like to extract content from a webpage. However, when I get response text it includes JavaScript, which cannot be processed like a browser-opened page.

Can this method be used to get HTML content or only browser emulation can help? Or maybe there are some different methods of receiving this content?

Dim oXMLHTTP As New MSXML2.XMLHTTP
Dim htmlObj As New HTMLDocument

With oXMLHTTP
    .Open "GET", "http://www.manta.com/ic/mtqyfk0/ca/riverbend-holdings-inc", False
    .send

    If .ReadyState = 4 And .Status = 200 Then            
        htmlObj.body.innerHTML = .responseText
        'do things
    End If

End With

Response text:

<!DOCTYPE html>
<head>
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
<meta http-equiv="cache-control" content="max-age=0" />
<meta http-equiv="cache-control" content="no-cache" />
<meta http-equiv="expires" content="0" />
<meta http-equiv="expires" content="Tue, 01 Jan 1980 1:00:00 GMT" />
<meta http-equiv="pragma" content="no-cache" />
<meta http-equiv="refresh" content="10; url=/distil_r_blocked.html?Ref=/ic/mtq599v/ca/45th-street-limited-partnership&amp;distil_RID=2115B138-A1BF-11E6-A957-C0595F6B962F&amp;distil_TID=20161103121454" />
<script type="text/javascript">
    (function(window){
        try {
            if (typeof sessionStorage !== 'undefined'){
                sessionStorage.setItem('distil_referrer', document.referrer);
            }
        } catch (e){}
    })(window);
</script>
<script type="text/javascript" src="/ser-yrbwqfedrrwwvctvyavy.js" defer></script><style type="text/css">#d__fFH{position:absolute;top:-5000px;left:-5000px}#d__fF{font-family:serif;font-size:200px;visibility:hidden}#verxvaxcuczwcwecuxsx{display:none!important}</style></head>
<body>
<div id="distil_ident_block">&nbsp;</div>
</body>
</html>

[Typically] When the server sends the response, what you get is what it sends. You cannot request "HTML only" (unless the server is somehow configured to support this, which seems unlikely). The only way to deal with dynamic content is via browser automation/selenium/etc. — Tim Williams
– Tim Williams, Commented Nov 3, 2016 at 20:01
The reason why you are getting the script is because it is in the HTML file directly. You could use an HTML parser to remove the script tags after you have downloaded the content. You can refer to the following thread on how to parse the DOM (stackoverflow.com/a/28917205/1640090). — vbguyny
– vbguyny, Commented Nov 3, 2016 at 20:19

Robin Mackenzie · Accepted Answer · 2016-11-05 02:04:26Z

1

No - because the Javascript is actually part of the HTML inside of <script> tags. You will have to post-process the response to remove the tags yourself.

You can use a function to remove the <script> nodes from the DOM after you have received the HTML for the page:

Function RemoveScriptTags(objHTML As HTMLDocument) As String

    Dim objElement As HTMLGenericElement

    For Each objElement In objHTML.all
        If VBA.LCase$(objElement.nodeName) = "script" Then
            objElement.removeNode
        End If
    Next objElement

    RemoveScriptTags = objHTML.DocumentElement.outerHTML

End Function

This can be included in your sample code like so:

Option Explicit

Sub Test()

    Dim objXMLHTTP As New MSXML2.XMLHTTP
    Dim objHTML As Object
    Dim strUrl As String
    Dim strHtmlNoScriptTags As String

    strUrl = "http://www.manta.com/ic/mtqyfk0/ca/riverbend-holdings-inc"

    With objXMLHTTP
        .Open "GET", strUrl, False
        .send

        If .ReadyState = 4 And .Status = 200 Then
            Set objHTML = CreateObject("htmlfile")
            objHTML.Open
            objHTML.write objXMLHTTP.responseText
            objHTML.Close

            'do things
            strHtmlNoScriptTags = RemoveScriptTags(objHTML)
            Debug.Print strHtmlNoScriptTags

            'update html document with script-less document
            Set objHTML = CreateObject("htmlfile")
            objHTML.Open
            objHTML.write strHtmlNoScriptTags
            objHTML.Close

            'you can now operate on DOM of objHTML

        End If

    End With

End Sub

edited Nov 5, 2016 at 2:04

answered Nov 4, 2016 at 7:00

Robin Mackenzie

19.4k7 gold badges42 silver badges63 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Ryszard Jędraszyk Over a year ago

The response text displayed after using this function is <DIV id=distil_ident_block> </DIV>. It removed JS tags - yes, but it doesn't in any way let me operate on HTML document which generated by this script if I use broswer.

Robin Mackenzie Over a year ago

Please check out my edit - I fixed the return value from the function to be the entire document (less script) tags and show how to put this text back into the HTMLDocument so you can work with the DOM objects.

Collectives™ on Stack Overflow

VBA - dealing with JavaScript content in XMLHTTP GET request

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related