1

So I've got a .csv file that I've imported into an array. They're all comma separated so I've gone ahead and made a nice array for em.

Now I'm trying to find records with matching id's so I can remove duplicates and only keep the last encountered. Using ID for instance.

I've imported to array but for some reason I can't get a tool like uniq to display the new unique list even though when I do .length on it, it returns the right amount of rows.

Any help would be greatly appreciated.

CODE

    lines = []
    i = 0

    file = File.open("./properties.csv", "r")

    elements = Array[]
    element2 = Array[]
    output = Array[]

    while (line = file.gets)
        i += 1
      # use split to break array up using commas
        arr = line.split(',')
        elements.push({ id: arr[0], streetAddress: arr[1], town: arr[2], valuationDate: arr[3], value: arr[4] })
    end

    file.close

    # Loop through array and sort nicely
     element2 = elements.group_by { |c| c[:id] }.values.select { |elements| elements.size > 1 }


    output = (element2.uniq)
    puts output

    puts element2.length

SAMPLE .CSV FILE

ID,Street address,Town,Valuation date,Value
1,1 Northburn RD,WANAKA,1/1/2015,280000
2,1 Mount Ida PL,WANAKA,1/1/2015,280000
3,1 Mount Linton AVE,WANAKA,1/1/2015,780000
1,1 Northburn RD,WANAKA,1/1/2015,330000
2,1 Mount Ida PL,WANAKA,1/1/2015,330000
3,1 Mount Linton AVE,WANAKA,1/1/2015,830000
1,1 Northburn RD,WANAKA,1/1/2016,340000
2,1 Mount Ida PL,WANAKA,1/1/2016,340000
3,1 Mount Linton AVE,WANAKA,1/1/2016,840000
4,1 Kamahi ST,WANAKA,1/1/2016,215000
5,1 Kapuka LANE,WANAKA,1/1/2016,209000
6,1 Mohua MEWS,WANAKA,1/1/2016,620000
7,1 Kakapo CT,WANAKA,1/1/2016,490000
8,1 Mt Gold PL,WANAKA,1/1/2016,1320000
9,1 Penrith Park DR,WANAKA,1/1/2016,1310000
4
  • You probably mean [] instead of Array[]. Commented May 12, 2016 at 4:05
  • that's true! But correct me here if I'm wrong but isn't that a distinction without a difference? Commented May 12, 2016 at 4:09
  • 1
    The difference is using Array[] is just plain bizarre. Using the simplest expression is generally the best. Commented May 12, 2016 at 4:12
  • haha thanks for that! Commented May 12, 2016 at 4:16

1 Answer 1

5

So I've actually swapped my approach to using hashes. which seems to automatically remove duplicates and leave the last encountered record intact? Can anyone shed some light here?

    require 'csv'

    element = {}

    CSV.foreach("properties.csv", :headers => true, :header_converters => :symbol) do |row|
        element[row.fields[0]] = Hash[row.headers[1..-1].zip(row.fields[1..-1])]
    end

    puts element["1"]

    element.each do |key, value|
        puts key 
        puts value
    end

    puts "#{element.length} records returned" 

To keep the first matching element, instead of the last, you can do a key existence check before assigning the value. This can be done like so:

CSV.foreach("properties.csv", :headers => true, :header_converters => :symbol) do |row|
  key = row.fields[0]
  if !element.key?(key)
    element[key] = Hash[row.headers[1..-1].zip(row.fields[1..-1])]
  end
end

which can also be written much more efficiently like this:

CSV.foreach("properties.csv", :headers => true, :header_converters => :symbol) do |row|
  element[row.fields[0]] ||= Hash[row.headers[1..-1].zip(row.fields[1..-1])]
end

Note that these methods to preserve the first found record for a key will perform much better than the version that preserves the final found record for a key. This is because of work avoidance, primarily in producing the hash value, which is done with slice and zip in this code.

Sign up to request clarification or add additional context in comments.

7 Comments

Anyone know how I could reverse this so the hash only takes the first duplicate entry instead of the last encountered?
Hash uses a unique key as its index. When you use element[row.fields[0]], what you're doing is overwriting the previous value in the hash for that key. This will give you the uniqueness, as long as you're fine with the last id value being the one that gets retained. The new code is generations better than the original, so kudos on coming to that solution! :D
thanks! What if I wanted to retain the first value entered into the hash and then ignore the following?
I updated the answer with some details that show how to do that, and the benefits of preserving the first match instead of the last one. Great follow-up question!
you are an actual legend. How do I give you "boss" points for comments/edits? Thank you so much.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.