2

I would like to extract content from a webpage. However, when I get response text it includes JavaScript, which cannot be processed like a browser-opened page.

Can this method be used to get HTML content or only browser emulation can help? Or maybe there are some different methods of receiving this content?

Dim oXMLHTTP As New MSXML2.XMLHTTP
Dim htmlObj As New HTMLDocument

With oXMLHTTP
    .Open "GET", "http://www.manta.com/ic/mtqyfk0/ca/riverbend-holdings-inc", False
    .send

    If .ReadyState = 4 And .Status = 200 Then            
        htmlObj.body.innerHTML = .responseText
        'do things
    End If

End With

Response text:

<!DOCTYPE html>
<head>
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
<meta http-equiv="cache-control" content="max-age=0" />
<meta http-equiv="cache-control" content="no-cache" />
<meta http-equiv="expires" content="0" />
<meta http-equiv="expires" content="Tue, 01 Jan 1980 1:00:00 GMT" />
<meta http-equiv="pragma" content="no-cache" />
<meta http-equiv="refresh" content="10; url=/distil_r_blocked.html?Ref=/ic/mtq599v/ca/45th-street-limited-partnership&amp;distil_RID=2115B138-A1BF-11E6-A957-C0595F6B962F&amp;distil_TID=20161103121454" />
<script type="text/javascript">
    (function(window){
        try {
            if (typeof sessionStorage !== 'undefined'){
                sessionStorage.setItem('distil_referrer', document.referrer);
            }
        } catch (e){}
    })(window);
</script>
<script type="text/javascript" src="/ser-yrbwqfedrrwwvctvyavy.js" defer></script><style type="text/css">#d__fFH{position:absolute;top:-5000px;left:-5000px}#d__fF{font-family:serif;font-size:200px;visibility:hidden}#verxvaxcuczwcwecuxsx{display:none!important}</style></head>
<body>
<div id="distil_ident_block">&nbsp;</div>
</body>
</html>
2
  • 1
    [Typically] When the server sends the response, what you get is what it sends. You cannot request "HTML only" (unless the server is somehow configured to support this, which seems unlikely). The only way to deal with dynamic content is via browser automation/selenium/etc. Commented Nov 3, 2016 at 20:01
  • 1
    The reason why you are getting the script is because it is in the HTML file directly. You could use an HTML parser to remove the script tags after you have downloaded the content. You can refer to the following thread on how to parse the DOM (stackoverflow.com/a/28917205/1640090). Commented Nov 3, 2016 at 20:19

1 Answer 1

1

No - because the Javascript is actually part of the HTML inside of <script> tags. You will have to post-process the response to remove the tags yourself.

You can use a function to remove the <script> nodes from the DOM after you have received the HTML for the page:

Function RemoveScriptTags(objHTML As HTMLDocument) As String

    Dim objElement As HTMLGenericElement

    For Each objElement In objHTML.all
        If VBA.LCase$(objElement.nodeName) = "script" Then
            objElement.removeNode
        End If
    Next objElement

    RemoveScriptTags = objHTML.DocumentElement.outerHTML

End Function

This can be included in your sample code like so:

Option Explicit

Sub Test()

    Dim objXMLHTTP As New MSXML2.XMLHTTP
    Dim objHTML As Object
    Dim strUrl As String
    Dim strHtmlNoScriptTags As String

    strUrl = "http://www.manta.com/ic/mtqyfk0/ca/riverbend-holdings-inc"

    With objXMLHTTP
        .Open "GET", strUrl, False
        .send

        If .ReadyState = 4 And .Status = 200 Then
            Set objHTML = CreateObject("htmlfile")
            objHTML.Open
            objHTML.write objXMLHTTP.responseText
            objHTML.Close

            'do things
            strHtmlNoScriptTags = RemoveScriptTags(objHTML)
            Debug.Print strHtmlNoScriptTags

            'update html document with script-less document
            Set objHTML = CreateObject("htmlfile")
            objHTML.Open
            objHTML.write strHtmlNoScriptTags
            objHTML.Close

            'you can now operate on DOM of objHTML

        End If

    End With

End Sub
Sign up to request clarification or add additional context in comments.

2 Comments

The response text displayed after using this function is <DIV id=distil_ident_block>&nbsp;</DIV>. It removed JS tags - yes, but it doesn't in any way let me operate on HTML document which generated by this script if I use broswer.
Please check out my edit - I fixed the return value from the function to be the entire document (less script) tags and show how to put this text back into the HTMLDocument so you can work with the DOM objects.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.