3

I would like to build an array from an HTML file using PowerShell.

I am using a script which download the HTML File from the Mozilla Firefox Developer Edition (I am downloading the index file) locally and I would like to parse it to get the value of the options elements inside the select element which have the id set to id_country.

I have been recommended to use XPath for that but I can't figure how to parse the file and build an array from the result. Maybe using regex could be a workaround.

The HTML file is here :

http://pastebin.com/b8cShFLA

And I would like to all the values of the options elements here:

<select aria-required="true" id="id_country" name="country" required="required">
   <option value="af">Afghanistan</option>
   <option value="al">Albania</option>
   <option value="dz">Algeria</option>
   <option value="as">American Samoa</option>
   <option value="ad">Andorra</option>

...

I am quite new to PowerShell that's why I am not really aware of different solutions I might be able to use. I would need something quite fast as it's part of a package installer.

Basically the script will try to see if there is an installer which match the locale of the user's computer and if not it will default to english that's why I need to get the values from that list in order to check the firefox dev available locales.

Regards, O

3 Answers 3

5

I don't see a code sample to fix, so I'll make one.

If it was a remote html I would use Invoke-WebRequest, but that doesn't work too well with local files.

For parsing of local files I would recommend using HTML Agility Pack to parse the HTML file, and then use xPath to get the options you're looking for. Ex.

Add-Type -Path .\HTMLAgilityPack\HtmlAgilityPack.dll
$url = (get-item .\b8cShFLA.html).FullName

$doc = New-Object HtmlAgilityPack.HtmlDocument
$doc.LoadHtml((get-content $url))

#Create hashtable to store data in
$langs = @{}

$doc.DocumentNode.SelectSingleNode("//select[@name='country']").SelectNodes("option") | ForEach-Object {
    $short = $_.Attributes[0].Value
    $long = $_.NextSibling.InnerText

    #Store data in hashtable
    $langs[$short] = $long
}

$langs

Ouput:

Name                           Value
----                           -----
rw                             Rwanda
tv                             Tuvalu
to                             Tonga
pn                             Pitcairn
bh                             Bahrain
lc                             Saint Lucia   
Sign up to request clarification or add additional context in comments.

Comments

5

If you're running PS 3.0 or above, you can take advantage of Invoke-WebRequest for pages that exist out on the web. If you're operating against a local file, it can be a bit finicky.

Invoke-WebRequest returns a HtmlWebResponseObject with a property called ParsedHtml. This object has a method named getElementById, which we can use since we know the id "id_country" on your select tag. From there, it is a simple matter to iterate the options tags and filter down to return the properties we would like... "Text" and "value".

The example below outputs a custom object containing the country name and the country code:

Code:

# I'm using your raw pastebin endpoint for this example
$result = Invoke-WebRequest "http://pastebin.com/raw.php?i=b8cShFLA"

# Only return specific properties from the elements you're looking for
$countries = $result.ParsedHtml.getElementById("id_country") | 
    Where tagName -eq "option" | 
    Select -Property Text, Value

# Country name and code are stored to this variable
$countries

Output:

text                                                        value
----                                                        -----
Afghanistan                                                 af
Albania                                                     al
Algeria                                                     dz
American Samoa                                              as
Andorra                                                     ad
...                                                         ...

You can then use the country name and code as you would any other property on powershell objects.

As for the web endpoint, it sounds like you could modify this script to point to the original Mozilla page you're extracting this HTML from?

2 Comments

What seems not very widely documented is this change from PowerShell 5.1 to PowerShell 7.x: file:// and ftp:// URI schemes are no longer supported. learn.microsoft.com/en-us/powershell/scripting/whats-new/… As it happens, I'm running PowerShell 5.1, so I still don't see why I can't parse a local file URI versus successfully parsing a hosted URI. Just sharing the information. The paucity of documentation and examples might lead me back to Python for this task. Just sharing the information for future strugglers.
For PowerShell 5.1 users, the $localUri string as provided here works in a browser, and returns a WebResponseObject using GetType(): $localUri = "file:///C:/Folder/File.hml"
0

For most HTML, another option is to load the file as XML and use it that way. See an example in my powershell tumbler file downloader:

https://github.com/jefflomax/powershell-download-tumbler-images

1 Comment

This assumes the content is well-formed, which HTML is typically not.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.