1

I have a list of messy titles (lets say 1000 of them). These titles I want to analyze for "keywords" that match a small number of genres that I have created (the titles arent a model, but the genres are).

For example, say the first title string is "awesome playlist of house, EDM and ambient"

Now, say I also have 15 Genres, each with an attribute name

My end goal is I want to assign genres to that title string. This is easy enough by doing some string normalization, and then using .include?

But it doesnt help if there are synonyms. For example, my @genre.name is called chill, which SHOULD apply to ambient on the string above. Likewise, my @genre.name for dance music is called dance, and should include EDM in the string above (edm = electronic dance music)

So what I'd love to do is for each genre add in 10 or so synonyms so it can check for those as well.

The problem is I'm not sure how to go about doing this in the loop.. I guess a loop inside a loop?

This is my code for a 'single level', without synonyms

  def determine_genres(title)
    relevant_genres = []
    @genres.each do |genre|
      if normalize_string(title).include? normalize_string(genre.name)
        relevant_genres << genre.id
      end
    end
    relevant_genres
  end
1
  • Too verbose. Make it shorter. Just say what is the input, and what output you want. Commented Oct 19, 2012 at 23:46

3 Answers 3

1

You're definitely on the right track when you say array of array of strings. I'd structure it more like:

genres = {
    'chill' => ['ambient','mood','chill'],
    'dance' => ['edm','trance','house',]
}

etc. so, each key in the hash is the name of the @genre.name, and the corresponding array is a list of all of the possible synonyms / subgenres for that @genre.

In ruby, there is a nifty array method that using & allows you to "intersect" two arrays and find the common values. Like so:

[1,2,3,4,5] & [0,3,5,6,8]  OUTPUT: [3,5]

See more here: http://www.ruby-doc.org/core-1.9.3/Array.html#method-i-26

If you intersect the normalized sentence and the array of all of the key terms, then you can say if the length of the outputted intersected array is > 0, then there were key terms that matched that genre and that genre is relevant.

So you would edit the loop as such (using the genres hash of arrays above):

def determine_genres(title)
  relevant_genres = []
  genres.each do |genre, terms|
    intersecting_terms = normalize_string(title) & terms
    if intersecting_terms.length > 0
      relevant_genres << Genre.find_by_name(genre).id
    end
  end
  relevant_genres
end

You could also have a field in the DB for the Genre model that stores the hash / array of synonomous terms.

Sign up to request clarification or add additional context in comments.

1 Comment

If you need any clarifications let me know.
0

mmm ok

what do you think about this approach, for each genre you will take a generic name (like ambient) and for each synonyms you will associate them with a hash. ie

hsh = {"chill" => "ambient",
 "chillout" => "ambient",
 "chilloff" => "ambient",
 "ambient" => "ambient",
 "trance"  => "electronic"
}

#then you just need to check the Hash like this:

puts hsh['chill']  #=> ambient
puts hsh['chillout'] #= ambient
puts hsh['trance'] #=> electronic

the down side is that you need to write down all these synonyms.

Comments

0

For each synonym, create an instance of Genre with its name being the synonym and the id being the same one as the representative one.

I am not sure if your structure is the most effective, but using it, you can still refactor it as this:

def determine_genres(title)
  title = normalize_string(title)
  @genres.select{|genre| title.include? normalize_string(genre.name)}.map(&:id)
end

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.