3

I have tried the following code to scrape a table from local HTML file stored on my PC

Sub Test()
Dim mtbl            As Object
Dim tableData       As Object
Dim tRow            As Object
Dim tcell           As Object
Dim trowNum         As Integer
Dim tcellNum        As Integer
Dim webpage         As New HTMLDocument
Dim fPath           As String
Dim strCnt          As String
Dim f               As Integer

fPath = Environ("USERPROFILE") & "\Desktop\LocalHTML.txt"
f = FreeFile()
Open fPath For Input As #f
strCnt = Input(LOF(f), f)
Close #f

webpage.body.innerHTML = strCnt

Set mtbl = webpage.getElementsByTagName("Table")(0)
Set tableData = mtbl.getElementsByTagName("tr")
Debug.Print tableData.Item(0).innerText

On Error GoTo TryAgain:
trowNum = 1

For Each tRow In tableData
    For Each tcell In tRow.Children
        tcellNum = tcellNum + 1
        Sheet1.Cells(trowNum, tcellNum) = tcell.innerText
    Next tcell
    trowNum = trowNum + 1
    tcellNum = 0
Next tRow
Exit Sub

TryAgain:
Application.Wait Now + TimeValue("00:00:02")
Err.Clear
Resume
End Sub

The code works with no errors but the results are incorrect in two points First the characters in Arabic appears on worksheet as questions marks. I mean the unicode characters are not read correctly Second point the data is scattered on the sheet in an unorganized structure

Here's the link of the local HTML file http://www.mediafire.com/file/oxpyzv4gc53kuwg/LocalHTML.txt

Thanks advanced for help

1 Answer 1

2

So, maybe this will help a little. It is not the complete answer I would like to give. Basically, the HTML is a mess (in my opinion). You don't have data laid out in rows (tr), with table cells (td) within, in a manner that you can use to easily isolate individual text elements.

I am offering the following really only to demonstrate the oddities of trying to isolate individual text components and to read/write with arabic characters preserved. I borrowed an adodb stream method from @whom to ensure UTF-8.

This method, looping table tags etc with hardcoded numbering, is ugly and really belongs in the sin bin. I use the fact that later tables have your individual components stored individually to reconstruct an overall table appearance with rows and columns.

But you may get something from it:

Option Explicit

Public Sub test()
    Dim fStream  As ADODB.Stream, html As HTMLDocument
    Set html = New HTMLDocument
    Set fStream = New ADODB.Stream
    With fStream
        .Charset = "UTF-8"
        .Open
        .LoadFromFile "C:\Users\User\Downloads\LocalHTML.html"
        html.body.innerHTML = .ReadText
        .Close
    End With

    Dim hTables As Object, startTableNumber As Long, i As Long, r As Long, c As Long
    Dim counter As Long, endTableNumber, numColumns As Long

    startTableNumber = 43
    endTableNumber = 330
    numColumns = 9

    Set hTables = html.getElementsByTagName("table")
    r = 2: c = 1

    For i = startTableNumber To endTableNumber Step 2
        counter = counter + 1
        If counter = 10 Then
            c = 1: r = r + 1: counter = 1
        End If
        Cells(r, c) = hTables(i).innerText
        c = c + 1
    Next

End Sub
Sign up to request clarification or add additional context in comments.

3 Comments

Thank you very much. You are a legend As for these numbers 43 and 330 hard-coded .. What they are referred to and are they fixed as I have other similar files ?
They came from inspecting the html table tags. If all school reports follow the same structure then there is a chance the numbers will hold but, to be honest, the method above is pretty specific and rigid so I don't hold much hope for it simply being copy, paste and use again.
Thanks a lot. The pages are different in number of rows and I tested on another page but throws an error. Just I need how can I determine the start row and end row as you did in code (43,330) How can I get those numbers?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.