1

I am trying to do a Regular Expression search on string assigned to the HTML content of web search. The pattern I am trying to match has the following format HQ 12345 the second fragment could also start with a letter so HQ A12345 is also a possibility. As shown in the code below the regex pattern I am using is "HQ .*[0-9]".

Problem is when i run the regex search the pattern matched is not just HQ 959693 but also includes the rest of the html file content as shown in the snapshot of the message box below. RegEx Pattern Matched

Sub Test()
   Dim mystring As String
   mystring = getHTMLData("loratadine")
   Dim rx As New RegExp
   rx.IgnoreCase = True
   rx.MultiLine = False
   rx.Global = True
   rx.Pattern = "HQ .*[0-9]"
   Dim mtch As Variant
   For Each mtch In rx.Execute(mystring)
      Debug.Print mtch
      MsgBox(mtch)
   Next
End Sub

Public Function getHTMLData (ByVal name As String) As String
   Dim XMLhttp: Set XMLhttp = CreateObject("MSXML2.ServerXMLHTTP")
   XMLhttp.setTimeouts 2000, 2000, 2000, 2000
   XMLhttp.Open "GET", "http://rulings.cbp.gov/results.asp?qu=" & name & "&p=1", False
   XMLhttp.send

   If XMLhttp.Status = 200 Then
      getHTMLData = XMLhttp.responsetext
   Else
      getHTMLData = ""
   End If
End Function
1
  • I'm not familiar with VBA, but had the same problem in C++. I think the string in the message box is correct because your regex engine returns any string which contains(!) your regular expression. In C++ I had to tell the engine to return a string only if it exactly matches(!) the regular expression. In C++ most engines provide a property or function "exactMatch()" to do this. Maybe your VBA engine provides a similar functionality? Commented Feb 20, 2014 at 15:21

2 Answers 2

2

Use ? to specify non-greedy, otherwise the match will consume up until the last digit of the entire string. Also, you are only matching one digit occurrence. Add a + to specify "one or more" so it will match your goal:

HQ .*?[0-9]+

Alternatively, you can try to use a negated character class like so:

HQ [^0-9]*[0-9]+

Or you can even simplify it further:

HQ [^\d]*\d+
Sign up to request clarification or add additional context in comments.

Comments

1

Regex matching is by default greedy. Unfortunately I didn't manage to reproduce precisely your issue, but I am pretty sure it is because you a long string which is being matched by '.*' to a number at the end.

I find this link useful, see the explaination near the bottom about the greediness of *

http://www.autohotkey.com/docs/misc/RegEx-QuickRef.htm

I suggest changing your Regex to:

HQ .*?[0-9]+

That will match the "HQ " and any number of characters, followed by any number of numeric characters. It will also consume the minimal amount in the ".*", because of the "?".

Please respond if this does not work and I will getting your Regex running in Excel.

1 Comment

Thanks for your response. You are absolutely right the issue is the string being long.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.