0

For instance we have following string. "The beatles - Imagine" Plus, we have a huge list of artist names in PostgreSQL.

Given that string I want to identify the artist using my database.

I'm looking for most optimal, fast algorithm/technology to do this. So iterating over all records in database and looking for substring is not applicable.

String can be "Imagine - The beatles", "Imagine, The Beatles". Just like song names in Youtube videos.

Would be Solr, ElasticSearch or other technology be helpful here? Would love some geek advices for this.

2
  • Also, "Beatles, The". Commented Jan 18, 2014 at 4:17
  • 2
    Erm... s/Beatles/John Lennon Commented Jan 18, 2014 at 10:56

2 Answers 2

2

There are two parts to this problem. The hard part is identifying artist and title. You've got all sorts of variations:

  • Beatles, The - Imagine
  • The Beatles - Imagine
  • Imagine - The Beatles
  • The Beatles, Imagine
  • Imagine, The Beatles
  • Imagine - Beatles, The

Others will include album too:

  • Imagine - Imagine - The Beatles

If you have these as a random mismash then you're going to have a hard time dealing with that - normalizing this data into fields is going to require a database of "track names" and "artist names" to attempt to match with, and a lot of guesswork.

What I'd do is ignore the whole mess and throw it at a full text search engine.

test=> select to_tsvector('simple', 'Beatles, The - Imagine');
           to_tsvector           
---------------------------------
 'beatles':1 'imagine':3 'the':2
(1 row)

test=> select to_tsvector('simple', 'Beatles, The - Imagine') @@ to_tsquery('simple', 'Beatles');
 ?column? 
----------
 t
(1 row)

If you were able to turn this into field-separated normalized data, your searches would become much more powerful as you could do weighted matches on fields using setweight, ts_rank, tsvector concatenation with ||, etc.

Sign up to request clarification or add additional context in comments.

Comments

0

In principle, if any record in your database could contain your search string then you are going to have to search every record in your database.

What you can do is to use something like the Rabin-Karp algorithm to search for a lot of same-length versions of your search string simultaneously: "Beatles The", "The Beatles". If you ignore spaces and/or punctuation then you might be able to reduce the number of passes even more: "The Beatles", "Beatles, The", "Beatles The". All the examples in Craig Ringer's answer are the same length if you count only letters; you could find all those matches with a single pass through the database using Rabin-Karp

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.