0

I am trying to parse an html page that contains these values:

<a href="somesite.html?id=123">...</a>
<a href="somesite.html?id=456">...</a>
<a href="somesite.html?id=789">...</a>
<a href="anothersite.html">...</a>

How would I parse the Html String to get back an array of where it only contains the somesite.html:

["somesite.html?id=123", "somesite.html?id=456", "somesite.html?id=456"]

Edited

Using Zhiguo Wang's base answer, I can't seem to get only the somesite.html id values... The 3rd item in the array contains excess characters:

let htmlString = "<a href=\"somesite.html?id=123\">...</a>" +
"<a href=\"somesite.html?id=456\">...</a>" +
"<a href=\"somesite.html?id=789\">...</a>" +
"<a href=\"anothersite.html\">...</a>\""
let seperateComponent = "<a href=\"somesite.html?id="

let linkExp = "[\\w\\W]*\">"

Returns this value:

["123", "456", "789\\">...</a><a href=\\"anothersite.html"]

Expected Value: ["123", "456", "789"]

...hmm. Changing linkExp to the below resolves it. What does \W represent in Regex?

let linkExp = "[\\w]*\">"

..The length is wrong. Casted to NSString to grabbed the proper length.

Edited 2

It looks like if this string comes first before the somesite, then it includes Origin in the array:

<meta name=\"referrer\" content=\"origin\">
2
  • @Wongzigii I feel like there's an easier solution than a 3rd party library. E.g all those a tags contain the same format of "somesite.html?id=". Can't regex do a find on those first characters up until the id=, then stop at the first double quotes? Idk how that would look though Commented Sep 22, 2015 at 4:18
  • <a href="(.*?)">.*?<\/a> Commented Sep 22, 2015 at 4:44

3 Answers 3

1

talk is cheap, show me the code

    let htmlString = "<a href=\"somesite.html?id=123\">...</a><a href=\"somesite.html?id=456\">...</a><a href=\"somesite.html?id=789\">...</a>"
    let seperateComponent = "<a href=\""

    let linkExp = "[\\w\\W]*\">"
    let linkRegExp = NSRegularExpression(pattern:linkExp, options: NSRegularExpressionOptions.CaseInsensitive, error: nil)
    let seperatedArray = htmlString.componentsSeparatedByString(seperateComponent)
    var resultArray = [String]()

    if seperatedArray.count > 1 {
        for seperatedString in seperatedArray {
            if seperatedString.lengthOfBytesUsingEncoding(NSUTF8StringEncoding) > 3{
                let myRange = linkRegExp!.rangeOfFirstMatchInString(seperatedString, options:NSMatchingOptions.ReportCompletion, range: NSMakeRange(0, seperatedString.lengthOfBytesUsingEncoding(NSUTF8StringEncoding)))
                if myRange.location != NSNotFound {
                    let matchString = (seperatedString as NSString).substringWithRange(myRange)
                    let linkString = (matchString as NSString).substringToIndex(matchString.lengthOfBytesUsingEncoding(NSUTF8StringEncoding) - 2)

                    resultArray.append(linkString)
                }
            }
        }
    }

    println(resultArray)

these codes have been run on xcode 6.4 and the result is right.sorry " i need at least 10 reputation to post images" so result pic won't be posted here.

Sign up to request clarification or add additional context in comments.

5 Comments

Note that this will crash if the input string contains multiple non-ASCII characters like "ÄÖÜ" or "€". The reason is that counting UTF-8 bytes is not the right method to compute an NSRange. Compare e.g. stackoverflow.com/questions/27880650/….
Thanks Zhiguo, I have updated the question to handle one last test case.
I can't seem to get it to work if i set the separatorComponent to: let seperateComponent = "<a href=\"somesite.html?id=" It includes the other link values.
Thanks Martin R,that's really a serious situation that i've never thought about.
And @TimNuwin i know your problem now ,and i'm glad to help 1.\W stands for all the capital characters from A to Z and numbers and other common symbols. 2.if you let seperateComponent = "<a href=\"somesite.html?id=" it won't be right because you separate the last two link together into the same string. 3.sorry i realized that i spelled wrong "separate" :D 4.improved code will be shown below
0

I think regular expression may go for a toss while parsing HTML files. You have better way of parsing HTML files the iOS way. Here is a tutorial on this. TFHpple and NDHpple are your friends here.

Here is a related SO thread.

Comments

0

here's the improved code

    let htmlString = "<a href=\"somesite.html?id=123\">...</a>" +
        "<a href=\"somesite.html?id=456\">...</a>" +
        "<a href=\"somesite.html?id=789\">...</a>" +
    "<a href=\"anothersite.html\">...</a>\""
    let seperateComponent = "<a href=\""

    let linkExp = "[\\w\\W]*\">"
    let linkRegExp = NSRegularExpression(pattern:linkExp, options: NSRegularExpressionOptions.CaseInsensitive, error: nil)
    let seperatedArray = htmlString.componentsSeparatedByString(seperateComponent)
    var resultArray = [String]()

    if seperatedArray.count > 1 {
        for seperatedString in seperatedArray {
            if seperatedString.lengthOfBytesUsingEncoding(NSUTF8StringEncoding) > 3{
                let myRange = linkRegExp!.rangeOfFirstMatchInString(seperatedString, options:NSMatchingOptions.ReportCompletion, range: NSMakeRange(0, seperatedString.lengthOfBytesUsingEncoding(NSUTF8StringEncoding)))
                if myRange.location != NSNotFound {
                    let matchString = (seperatedString as NSString).substringWithRange(myRange)

                    let linkWished = "somesite.html?id="

                    if matchString.componentsSeparatedByString(linkWished).count > 1{

                        var linkString = (matchString as NSString).substringFromIndex(linkWished.lengthOfBytesUsingEncoding(NSUTF8StringEncoding))

                        linkString = (linkString as NSString).substringToIndex(linkString.lengthOfBytesUsingEncoding(NSUTF8StringEncoding) - 2)

                        resultArray.append(linkString)
                    }


                }
            }
        }
    }

    println(resultArray)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.