0

When using VBA to parse HTML in a cell row, some tags like give problems.

For example, if I have the following content in a excel cell:

<div><section>hello</section></div>

And I then apply the following function

Public Function mainclean(sourceText As String) As String

Dim DOC As New HTMLDocument
DOC.body.innerHTML = sourceText

mainclean = DOC.body.innerHTML
End Function

What I get is the following:

<DIV>hello</SECTION></DIV>

The beginning of the section tag is being stripped. Clearly the tag section is not being recognised as HTML code.

The same happens with non-html tags like <mycustomtag></mycustomtag>

Does it exist any workaround?

Thanks

5
  • 1
    maybe because you are declaring sourceText as a string. You declared DOC as HTMLDocument but then you turn it into a string with DOC.body.innerHTML = sourceText. Just a guess though. Commented Jun 12, 2020 at 16:53
  • I don't think HTMLDocument implements the latest version of IE - you may find that recent/HTML5 tags are not supported. Commented Jun 12, 2020 at 17:15
  • 1
    If you want the innertext of the html, change innerHTML against innerText. Your html code is in Excel cells because? I've never heard that's necessary. Commented Jun 12, 2020 at 17:23
  • @Zwenn - good catch - I missed that entirely... Commented Jun 12, 2020 at 17:28
  • @TimWilliams First I had also overlooked it and written something completely different. String in, String out had distracted me. Commented Jun 12, 2020 at 17:33

1 Answer 1

1

When using HTMLDocument the default documentMode is IE5, which means it will have some problems with recent/HTML5 tags.

If required you can get around this by using CreateObject("htmlfile") which creates the same type of object, but its behaviour seems to be slightly different.

Sub Tester()

    Dim testHTML As String
    testHTML = "<div><section>hello</section></div>"

    Debug.Print mainclean(testHTML)

    Debug.Print mainclean2(testHTML)

End Sub

Public Function mainclean2(sourceText As String) As String
    Dim DOC 'As New HTMLDocument
    Set DOC = CreateObject("htmlfile")
    Debug.Print TypeName(DOC) '>>HTMLDocument
    Debug.Print "htmlfile Default doc mode", DOC.documentMode  '>>5
    DOC.Open "text/html"
    'next line switches document mode to 8 but commenting it out
    '  still gives the "correct" output with docMode 5 (??)
    DOC.write "<head><meta http-equiv=""X-UA-Compatible"" content=""IE=Edge""></head>"
    DOC.write "<body>" & sourceText & "</body>"
    DOC.Close
    Debug.Print "Fixed doc mode", DOC.documentMode '>>8
    mainclean2 = DOC.body.innerHTML                '>>  <DIV><SECTION>hello</SECTION></DIV>
End Function

Public Function mainclean(sourceText As String) As String
    Dim DOC As New HTMLDocument
    Debug.Print TypeName(DOC)                       '>>HTMLDocument
    Debug.Print "HTMLDocument Default doc mode", _
                               DOC.documentMode     '>> 5
    DOC.body.innerHTML = sourceText
    mainclean = DOC.body.innerHTML                  '>> <DIV>hello</SECTION></DIV>
End Function

Related: VBA doesn't read XMLHTTP request's response according to its tree structure

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.