Identifying duplicates in specific CSV output

Question

Ruby newbie here. I've got a product csv where first col is a unique SKU and second col is a product ID that can be duplicated across multiple products (+ many other cols but these are the pertinent ones). Like:

SKU     | Prod ID
 99     | 10384
100     | 10385
101     | 10385
102     | 10386
103     | 10386
104     | 10387

In the script I'm writing, the first time a product ID is used will become a 'parent', and any subsequent instances of the product ID get treated differently (ie, different sizes).

Currently am reading in the whole CSV rather than doing foreach line as I assumed I'd need all the data available to find the duplicates.

Issue is I'm not sure on the how to be able to identify the first time a product ID is used and then identifying any further instances of it's use.

My first thought was to somehow identify the duplicates (uniq?) and then create a new column and put a 1 if it's the first time it's occurred and 0 if it's occurred previously. After looking at uniq I'm not sure how I then go back to the main list and mark my 1's and 0's.

Can someone please point me in the direction of the classes/methods I need to be looking at?

Thanks, Liam

Edit for John D: This gives me the hashes but in 1:1 format not 1: all instances of prod ID

CSV.foreach(INPUT, :headers => true , :header_converters => :symbol, :col_sep => "|",     :quote_char => "\x00") do |csv_obj|
  items[csv_obj.fields[0]] = [csv_obj.fields[1]]
end

so gives; "230709"=>["88507"], "109064"=>["9019"]

John Dibling · Accepted Answer · 2014-03-26 12:08:46Z

2

You're thinking of the Sku as the unique identifier, which it may in fact be. But if you turn that on it's head and think of the ProductID as the unique identifier, then you can build a Hash where the key is the ProductID and the value is an Array of Skus. Then you'll be able to track which Skus are associated with which ProductID.

Of course you'll read this in some other way, but the end result would be similar to:

products = 
{
  10384 => [99],
  10385 => [100, 101],
  10386 => [102, 103],
  10387 => [104]
}

Here's an example of how to construct this Hash:

#!/usr/bin/env ruby
require 'csv'

source = [
  "99|110384",
  "100|10385",
  "101|10385",
  "102|10386",
  "103|10386",
  "104|10387"
].join("\n")

source = CSV.parse(source, :col_sep => "|")

hh = source.inject({}) do |memo, row|
  sku = row[0]
  prod = row[1]

  memo[prod] = [] unless memo.include?(prod) 
  memo[prod] << sku
  memo
end

puts hh

edited Mar 26, 2014 at 12:08

answered Mar 26, 2014 at 1:25

John Dibling

102k34 gold badges192 silver badges335 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

gorlaz Over a year ago

Thanks John, I'm able to create the hash/array structure but I can't get my head around how to merge the duplicates?

gorlaz Over a year ago

Thanks John, the dbl parameter block on the do was new to me, as was inject and include?

John C · Accepted Answer · 2014-03-26 11:40:31Z

2

.group_by() is relatively new (though it has an older counterpart in Rails), but is awfully convenient and should do most of your heavy lifting.

If you create a class to hold each row and put them in an Array, then you can call the group_by method with a block that just checks each object's Product ID field.

That gives you a Hash, which you can iterate through with .keys.each.

Assuming a whole bunch of things about your program that are hopefully semi-obvious, something like:

transactionHash = transactions.group_by { |x| x.productId }

Then, you can go through your transaction lists per product with:

transactionHash.each do |prodId,transList|
  # transList has all of your transaction objects per product
end

Again, that assumes you're keeping your transactions in a list of objects. The x.productId would be something like x[1] if you store each transaction in an array, for example.

edited Mar 26, 2014 at 11:40

answered Mar 26, 2014 at 1:26

John C

1,9911 gold badge23 silver badges37 bronze badges

3 Comments

gorlaz Over a year ago

Thanks John C; Just trying to work out how the terminology fits. I've got an array of arrays from reading in the csv, but can't get my head around how I call group by on the detailed row array inside the array holding the csv's data.

John C Over a year ago

I avoided specifics, since I didn't want to assume the wrong structure for your data and mislead you, but I've added what I think is generic enough to use. The key is that group_by just takes an expression that explains what defines the groupings.

gorlaz Over a year ago

Thanks John C; this helped me a lot as well; ie I refactored my existing code to use a class, feed it in then use group by; where previously I'd just been dealing with raw array's stuffed into variables.

Collectives™ on Stack Overflow

Identifying duplicates in specific CSV output

2 Answers 2

2 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related