2

So, I have this spreadsheet with a list of about 5000 URLs. (All pages on our corporate intranet.)

We know some of the links are broken, but don't know of a good way to determine which without clicking all 5000 links.

Normally this would be a simple matter: Create a web page with links to the 5000 pages, and then check the links with a tool like Xenu Link Sleuth.

But that won't work in this case because many of the links are being redirected, and the redirect code spoofs HTTP.response 200, which tricks Xenu into treating it as a valid URL.

However, there is some good news: The redirect script does not run from within Excel. If you click a bad link inside Excel, the redirect script does not execute and the HTTP response is reported back to Excel. I believe Excel should be able to identify the correct HTTP response code (404) - or at least whether the link was valid or not.

Which brings me to my question:

Is there a way using VBA to write a script that would click through every link and capture the result? The result captured could be in the form of the HTTP response code or anything else you think would be useful in finding the bad links in this list of 5000 pages. Ideally the result would be written to a cell in the spreadsheet adjacent to the link.

If anyone if familiar enough with VBA to suggest a solution to this problem, I would be eternally grateful!

2 Answers 2

4

Here is an example to check the status line from a list of URL with Excel:

Sub TestLinks()
  Dim source As Range, req As Object, url$
  Set req = CreateObject("Msxml2.ServerXMLHTTP.6.0")

  ' define were the links and results are
  Set source = Range("A1:B2")

  ' clear the results
  source.Columns(2).Clear

  ' iterate each row
  For i = 1 To source.Rows.count
    ' get the link from the first column
    url = source.Cells(i, 1)

    ' send the request using a HEAD to check the status line
    req.Open "HEAD", url, False
    req.setRequestHeader "Accept", "image/webp,image/*,*/*;q=0.8"
    req.setRequestHeader "Accept-Language", "en-GB,en-US;q=0.8,en;q=0.6"
    req.setRequestHeader "Accept-Encoding", "gzip, deflate"
    req.setRequestHeader "Cache-Control", "no-cache"
    req.setRequestHeader "Content-Type", "text/xml; charset=utf-8"
    req.setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36"
    req.Send

    ' write the result in the second column
    source.Cells(i, 2) = req.Status
  Next

  MsgBox "Finished!"
End Sub
Sign up to request clarification or add additional context in comments.

2 Comments

Using Xmlhttp is possible, too. For me, I found using WinHttpRequest to be a bit more robust.
Thanks very much, @florentbr! I'm still learning my way around the Visual Basic window in Excel. I'm going to try to get @bioschaf's script working first and then will try to get yours to work. I do appreciate the help!
3

Use a user defined function to return HTML-Status Codes and drag it down next to the links. Might take a while for Excel to check 5000 links, though.

Public Function CheckURL(url As String) As String
Dim request As New WinHttpRequest
request.Open "GET", url
request.Send
CheckURL = request.Status
End Function

You will probably need to add a reference to "Microsoft WinHTTP Services" under "Extras" -> "References"

11 Comments

Hi bioschaf, thank you very much for the help! I've hardly ever touched Excel VBA. Would you mind giving me a little guidance? In my spreadsheet I have enabled the developer tab, and then opened the visual basic window. From there, I pasted your code into the code area. Then back in the spreadsheet, I attempted to execute the function using =CheckURL(A2). But this results in an error. Any idea what I'm doing wrong? (I did enable Microsoft WinHTTP Services" as suggested.)
Hi, you need to insert the function in a module to make it work. So in the VBA-windows, right click in your project, choose "Insert -> Module" and paste the code there. Then you should be able to use the function within excel-cells.
Awesome, thank you very much. That worked. I tested on 3 rows and got results - now I'm copying down 5000 rows and Excel is "busy." :-)
Yes, it will take a lot of time. Maybe it would be a good idea to test with some hyperlinks that should return 404 - just to make sure the macro produces the correct results.
Good thinking. I've been playing with it; the results are mixed so far. I tested with some invalid URLs but got 401 instead of 404. I tested with some redirected URLs as well, and these also returned 401. And most baffling, I tested a couple of VALID URLs and they also returned 401. But not everything is returning 401. There are a handful of 200's, a few 403's and 404's, some 503's, and some that show #VALUE! All of the #VALUE!'s correspond to pages on sites that no longer exist.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.