0
<tr>
    <td>Tanks:<br /><i>Lost:<br />Destroyed:</i></td>
    <td>750<br /><i>6<br />18</i></td>
</tr>
<tr>
    <td>Tanks:<br /><i>Lost:<br />Destroyed:</i></td>
    <td>750<br /><i>6<br />18</i></td>
</tr>

I am trying to scrape data from a website that has html structured like this within VBA. the value of interest that I want is "750" however it can sometimes be 0, 1,000,000, or any number in between so a set number of characters to extract wont work.

can anyone give some insight on the best way to scrape this? this is my code that will import all of the text as is, but the logic to post process and trim the data of interest is proving very difficult so i am looking for a nice clean way to scrape the 750 slot as is.

Set elems = IE.document.getElementsByTagName("tr")
    For Each e In elems

        If e.innerText Like "Tanks:*" Then
            msgbox e
        End If

    next e
2
  • 1
    if it's always the 3rd td, then use xpath: //tr/td[3]. if it can be any n-th child, then still xpath: //tr/td[contains(text(), 'Destroyed')]/following-sibling::td/ Commented Jan 5, 2015 at 19:48
  • i believe it always will be the third td. could you please expand upon your answer? it is over my head at this point how to implement it. thank you Commented Jan 5, 2015 at 19:49

1 Answer 1

1

Within the row (tr), the content you want seems to be always in the second tdand it is the first content before the linebreak <br/>. The stable structure of your HTML seems to be:

<tr>
    <td>
    </td>

    <td> 'we look for the first stuff inside here, before the </br> comes
    </td>
</tr>

So, starting from your code:

Set elems = IE.document.getElementsByTagName("tr")
For Each e In elems

If e.innerText Like "Tanks:*" Then 'finding the right <tr>

    'get full HTML inside the <tr></tr>
     fullHTML = e.innerHTML

    'first step: parsing until the second <td> comes out...
    lookFor = "<td>"
    startPos = 8 'we can ignore the first 4, we know that <td> is not the one we look for    
    foundThis = Right(Left(fullHTML,startPos),4) 'store current 4 characters    
    Do While foundThis <> lookFor
        startPos = startPos + 1
        foundThis = Right(Left(fullHTML,startPos),4)
    Loop
    'once out, we can take the string starting from your 750 until the end
    remainingHTML = Right(Left(fullHTML,startPos+6),Len(fullHTML)-startPos)     
    'so now we parse until we encounter the "<" of the break row tag    
    myValue = ""
    startPos = 1
    newParse = Right(Left(remainingHTML,startPos),1)
    Do While newParse <> "<"
        myValue = myValue & newParse
        startPos = startPos + 1
        newParse = Right(Left(remainingHTML,startPos),1)
    Loop    

    MsgBox myValue 'here is your 750, 1,000,000 or whatever else

End If

Next e

Please note that the parsing would be much easier if you could reference a JavaScript library in your VBA project. In that case, you could just create a list of children:

If e.innerText Like "Tanks:*" Then
    puppies = e.children
    'puppies = ["<td></td>", "<td></td>"]
End If

Like this, you could directly parse the second element of the collection. NOTE the code is not tested and might need to be revised in debug to make it working properly. This is just an idea of how you can structure your parsing.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.