3

I am working with some UK address data which within an Excel cell is split into its constituent parts by a comma.

I have some VBA which I've taken from the web which has removed a number of exact duplicated entries but I am left with a large amount of data which has repeating segments some sequentially and some non sequentially.

Attached is an image highlighting what I am trying to achieve, the code I have used thus far which is not mine is included to show you the direction in which I have been looking. Anyone have any further thoughts on how this can be achieved?

Function stringOfUniques(inputString As String, delimiter As String)
Dim xVal As Variant
Dim dict As Object
Set dict = CreateObject("Scripting.Dictionary")

For Each xVal In Split(inputString, delimiter)
dict(xVal) = xVal
Next xVal

stringOfUniques = Join(dict.Keys(), ",")
End Function

This did manage to get rid of a number of them but there is a huge population that I am working on so automating this would be incredible.

Ideal Outcome

1
  • A RegExp with back-references would be another possible option Commented Mar 15, 2016 at 4:22

3 Answers 3

4

Possibly not the most elegant answer, but this does the trick. Here I use the Split command to split the string at each comma. The result returned from this is

bat ball banana

Code:

Option Explicit
Private Sub test()
 Dim Mystring As String
 Dim StrResult As String

 Mystring = "bat,ball,bat,ball,banana"
 StrResult = shed_duplicates(Mystring)
End Sub
Private Function shed_duplicates(ByRef Mystring As String) As String
 Dim MySplitz() As String
 Dim J As Integer
 Dim K As Integer
 Dim BooMatch As Boolean
 Dim StrTemp(10) As String ' assumes no more than 10 possible splits!
 Dim StrResult As String


 MySplitz = Split(Mystring, ",")
  For J = 0 To UBound(MySplitz)
     BooMatch = False
     For K = 0 To UBound(StrTemp)
         If MySplitz(J) = StrTemp(K) Then
            BooMatch = True
            Exit For
         End If
     Next K
    If Not BooMatch Then
       StrTemp(J) = MySplitz(J)
    End If
Next
For J = 0 To UBound(StrTemp)
   If Len(StrTemp(J)) > 0 Then ' ignore blank entries
      StrResult = StrResult + StrTemp(J) + " "
   End If
Next J
Debug.Print StrResult
End Function
Sign up to request clarification or add additional context in comments.

Comments

3
+50

You may really use a regex replacement:

^(\d*\s*([^,]*),.*)\2(,|$)

The replacement pattern is

$1$3

See the regex demo. The pattern explanation:

  • ^ - start of a string (or of a line if .MultiLine = True)
  • (\d*\s*([^,]*),.*) - Group 1 (later referenced to with $1 backreference from the replacement pattern) matching:
    • \d* - 0+ digits followed with
    • \s* - 0+ whitespace characters
    • ([^,]*) - Group 2 (later we can use \2 in-pattern backreference to refer to the value captured with this subpattern) matching 0+ characters other than a comma
    • ,.* - a comma followed with 0+ characters other than a newline
  • \2 - the text captured by Group 2
  • (,|$) - Group 3 (later referenced to with $3 from the replacement pattern - to restore the comma) matching either a comma or the end of string (or line if .MultiLine = True).

NOTE: You do not need .MultiLine = True if you just check individual cells with containing one address.

Below is a sample VBA Sub showing how this can be used in VBA:

Sub test()
  Dim regEx As Object
  Set regEx = CreateObject("VBScript.RegExp")
  With regEx
      .pattern = "^(\d*\s*([^,]*),.*)\2(,|$)"
      .Global = True
      .MultiLine = True ' Remove if individual addresses are matched
  End With
  s = "66 LAUSANNE ROAD,LAUSANNE ROAD,HORNSEY" & vbCrLf & _
      "9 CARNELL LANE,CARNELL LANE,FERNWOOD" & vbCrLf & _
      "35 FLAT ANDERSON HEIGHTS,1001 LONDON ROAD,FLAT ANDERSON HEIGHTS" & vbCrLf & _
      "27 RUSSELL BANK ROAD,RUSSEL BANK,SUTTON COLDFIELD"
  MsgBox regEx.Replace(s, "$1$3")
End Sub

enter image description here

2 Comments

Great job! Would note that a small tweak to remove doubled up commas post the regex replace would be worthwhile.
I had never ever seen the regex function, exemplary solution!
1

A first solution would be to use a dictionary to get a list of unique segments. It would then be as simple as skipping the first address number before splitting the segments:

Function RemoveDuplicates1(text As String) As String
  Static dict As Object
  If dict Is Nothing Then
    Set dict = CreateObject("Scripting.Dictionary")
    dict.CompareMode = 1  ' set the case sensitivity to All
  Else
    dict.RemoveAll
  End If

  ' Get the position just after the address number
  Dim c&, istart&, segment
  For istart = 1 To Len(text)
    c = Asc(Mid$(text, istart, 1))
    If (c < 48 Or c > 57) And c <> 32 Then Exit For  ' if not [0-9 ]
  Next

  ' Split the segments and add each one of them to the dictionary. No need to keep 
  ' a reference to each segment since the keys are returned by order of insertion.
  For Each segment In Split(Mid$(text, istart), ",")
    If Len(segment) Then dict(segment) = Empty
  Next

  ' Return the address number and the segments by joining the keys
  RemoveDuplicates1 = Mid$(text, 1, istart - 1) & Join(dict.keys(), ",")
End Function

A second solution would be to extract all the segments and then search if each one of them is present at a previous position:

Function RemoveDuplicates2(text As String) As String
  Dim c&, segments$, segment$, length&, ifirst&, istart&, iend&

  ' Get the position just after the address number
  For ifirst = 1 To Len(text)
    c = Asc(Mid$(text, ifirst, 1))
    If (c < 48 Or c > 57) And c <> 32 Then Exit For  ' if not [0-9 ]
  Next

  ' Get the segments without the address number and add a leading/trailing comma
  segments = "," & Mid$(text, ifirst) & ","
  istart = 1

  ' iterate each segment
  Do While istart < Len(segments)

    ' Get the next segment position
    iend = InStr(istart + 1, segments, ",") - 1 And &HFFFFFF
    If iend - istart Then

      ' Get the segment
      segment = Mid$(segments, istart, iend - istart + 2)

      ' Rewrite the segment if not present at a previous position
      If InStr(1, segments, segment, vbTextCompare) = istart Then
        Mid$(segments, length + 1) = segment
        length = length + Len(segment) - 1
      End If
    End If

    istart = iend + 1
  Loop

  ' Return the address number and the segments
  RemoveDuplicates2 = Mid$(text, 1, ifirst - 1) & Mid$(segments, 2, length - 1)

End Function

And a third solution would be to use a regular expression to remove all the duplicated segments:

Function RemoveDuplicates3(ByVal text As String) As String

  Static re As Object
  If re Is Nothing Then
    Set re = CreateObject("VBScript.RegExp")
    re.Global = True
    re.IgnoreCase = True
    ' Match any duplicated segment separated by a comma.
    ' The first segment is compared without the first digits.
    re.Pattern = "((^\d* *|,)([^,]+)(?=,).*),\3?(?=,|$)"
  End If

  ' Remove each matching segment
  Do While re.test(text)
    text = re.Replace(text, "$1")
  Loop

  RemoveDuplicates3 = text
End Function

These are the execution times for 10000 iterations (the lower the better):

input text  : "123 abc,,1 abc,abc 2,ABC,abc,a,c"
output text : "123 abc,1 abc,abc 2,a,c"

RemoveDuplicates1 (dictionary)  : 718 ms
RemoveDuplicates2 (text search) : 219 ms
RemoveDuplicates3 (regex)       : 1469 ms

1 Comment

Once again thank you for the multitude of elegant solutions each seems to do what I require. Appreciated!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.