1

Within an Excel column I have data such as:

"Audi (ADI), Mercedes (modelx) (MEX), Ferrari super fast, high PS (FEH)"

There hundreds of models that are described by a name and an abbreviation of three capitalized letters in brackets.

I need to extract the names only and the abbreviations only to separate cells. I succeeded doing this for the abbreviations by the following module:

Function extrABR(cellRef) As String
    Dim RE As Object, MC As Object, M As Object
    Dim sTemp As Variant
    Const sPat As String = "([A-Z][A-Z][A-Z][A-Z]?)"  ' this is my regex to match my string
    
    
Set RE = CreateObject("vbscript.regexp")
With RE
    .Global = True
    .MultiLine = True
    .Pattern = sPat
    If .Test(cellRef) Then
        Set MC = .Execute(cellRef)
        For Each M In MC
            sTemp = sTemp & ", " & M.SubMatches(0)
        Next M
    End If
End With

extrABR = Mid(sTemp, 3)

End Function 

However, I do not manage to do so for names. I thought of just exchanging the regex by the following regex: (^(.*?)(?= \([A-Z][A-Z][A-Z])|(?<=, )(.*)(?= \([A-Z][A-Z][A-Z])), but VBA does not seem to allow lookbehind.

Any idea?

2
  • You will get a collection of abbreviations with "\([^)]+\)" ... and with second match replace all those with "", you will get a string without abbreviations which you can then split for names. Commented Jul 12, 2021 at 15:07
  • Do you mean you need to obtain an array of Audi, Mercedes (modelx), and Ferrari super fast, high PS? Commented Jul 12, 2021 at 15:21

2 Answers 2

1

Correct, lookbehinds are not supported, but they are only necessary when your expected matches overlap. It is not the case here, all your matches are non-overlapping. So, you can again rely on capturing:

(?:^|,)\s*(.*?)(?=\s*\([A-Z]{3,}\))

See the regex demo. Group 1 values are accessed via .Submatches(0).

Details:

  • (?:^|,) - either start of a string or a comma
  • \s* - zero or more whitespace chars
  • (.*?) - Capturing group 1: any zero or more chars other than line break chars as few as possible
  • (?=\s*\([A-Z]{3,}\)) - a positive lookahead that matches a location that is immediately followed with
    • \s* - zero or more whitespace chars
    • \( - a ( char
    • [A-Z]{3,} - three or more uppercase chars
    • \) - a ) char.

Demo screenshot:

enter image description here

Sign up to request clarification or add additional context in comments.

3 Comments

Great to learn this .. May be OP is expecting max 4 capital letters between the parenthesis. So can we replace ([A-Z]{3,} with ([A-Z]{4,} and it will match 3 or 4 capital letters?
@Naresh To match three or four, [A-Z]{3,4} should be used.
Got it.. thank you. I was just having a look at this learn.microsoft.com page
0

RE.REPLACE -- Try this function.. anything between the parenthesis will be replaced with "" giving you string of model names only, which you can then split on comma and get string array if so desired.

Function ModelNames(cellRef) As String
    Dim RE As Object, MC As Object, M As Object
    Dim sTemp As Variant, sPat As String
    sPat = "\([^)]+\)"
'Or you can use your formula pattern "([A-Z][A-Z][A-Z][A-Z]?)" to get (modelx)  in the final output.

Set RE = CreateObject("vbscript.regexp")
With RE
    .Global = True
    .MultiLine = True
    .Pattern = sPat
End With

ModelNames = RE.Replace(cellRef, "")

End Function

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.