4

I have a dataset consisting of about 50,000 rows, each line (or cell) with values seperated with a comma.

item 1, item 2, item 1, item 1, item3, item 2, item 4, item3

the goal output is simply

item 1, item 2, item3, item 4

I can use excel, open office calc, notepad++, or any other freely available program (I found a javascript solution, however it was for a single string, attempting to run it 50,000 times either did not work, or would take way longer than I have, and I don't know enough JS to adjust it)

any suggestions on how to do this?

<edited to note that some items will contain spaces>

4
  • they would be, however that will not happen; I have edited the original statement to adjust that fact (when I added spaces, I missed item2 once) It is perfectly OK for the script to assume they are seperate values; as that example will not exist in the dataset. (in regards to deleted comment about item1 and item 1 being duplicates or not) Commented Jun 27, 2012 at 18:55
  • forgot to note; additional bonus super points to any output that puts them out in alphabetical order. Commented Jun 27, 2012 at 19:01
  • i have a VBA script for this, but on my work computer, i'll post it in 17 hours if noone else answer it sooner Commented Jun 27, 2012 at 19:07
  • sure thing; would be much appreciated! I have tried a few things, but they all lock up when dealing with a dataset this large, as many strings have 50-100 values, and its about 50k rows. Commented Jun 27, 2012 at 19:10

1 Answer 1

4

Should get you started. Turn off screenupdating and calculation to get better performance...

Sub Tester()

    Dim dict As Object
    Dim arrItems, c As Range, y As Long
    Dim val

    Set dict = CreateObject("scripting.dictionary")

    For Each c In ActiveSheet.Range("A1:A100").Cells

        arrItems = Split(c.Value, ",")
        dict.RemoveAll
        For y = LBound(arrItems) To UBound(arrItems)
            val = Trim(arrItems(y))
            If Not dict.exists(val) Then dict.Add val, 1
        Next y

        c.Offset(0, 1).Value = Join(ArraySort(dict.keys), ",")

    Next c

End Sub

For sorting the keys:

Function ArraySort(MyArray As Variant)

    Dim First           As Integer
    Dim Last            As Integer
    Dim i               As Integer
    Dim j               As Integer
    Dim Temp

    First = LBound(MyArray)
    Last = UBound(MyArray)
    For i = First To Last - 1
        For j = i + 1 To Last
            If MyArray(i) > MyArray(j) Then
                Temp = MyArray(j)
                MyArray(j) = MyArray(i)
                MyArray(i) = Temp
            End If
        Next j
    Next i
    ArraySort = MyArray

End Function
Sign up to request clarification or add additional context in comments.

6 Comments

worked wonderfully; would there be a good way for the output to be sorted alphabetically as well? Unfortunately I don't know any VB (I figured out to modify the range, and "," to ", "; but as for actually coding a new function; im clueless)
running this generates the error "User-Defined type not verified" or User-Defined type not defined" - a quick google search said it is references, what should be referenced to make this work properly (the sorting?)
I just retested the code from my answer and it works fine, so I don't know why you're seeing that error. Is there any other code in that workbook which might be causing the problem? If you can't fix it I can send you a workbook with the working code.
let me try running it in a totally seperate workbook, and migrating my data to that real fast? will edit this reply shortly to post results
ArraySort is called from the other sub - it's not a worksheet function. It's just there to sort the de-duplicated output from Tester.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.