Algorithm: Extract keyword from string by database

Question

For instance we have following string. "The beatles - Imagine" Plus, we have a huge list of artist names in PostgreSQL.

Given that string I want to identify the artist using my database.

I'm looking for most optimal, fast algorithm/technology to do this. So iterating over all records in database and looking for substring is not applicable.

String can be "Imagine - The beatles", "Imagine, The Beatles". Just like song names in Youtube videos.

Would be Solr, ElasticSearch or other technology be helpful here? Would love some geek advices for this.

Also, "Beatles, The".

Craig Ringer
– Craig Ringer

2014-01-18 04:17:44 +00:00
Commented Jan 18, 2014 at 4:17 — Craig Ringer
– Craig Ringer, Commented Jan 18, 2014 at 4:17
Erm... s/Beatles/John Lennon

rossum
– rossum

2014-01-18 10:56:14 +00:00
Commented Jan 18, 2014 at 10:56 — rossum
– rossum, Commented Jan 18, 2014 at 10:56

Craig Ringer · Accepted Answer · 2014-01-18 04:28:23Z

There are two parts to this problem. The hard part is identifying artist and title. You've got all sorts of variations:

Beatles, The - Imagine
The Beatles - Imagine
Imagine - The Beatles
The Beatles, Imagine
Imagine, The Beatles
Imagine - Beatles, The

Others will include album too:

Imagine - Imagine - The Beatles

If you have these as a random mismash then you're going to have a hard time dealing with that - normalizing this data into fields is going to require a database of "track names" and "artist names" to attempt to match with, and a lot of guesswork.

What I'd do is ignore the whole mess and throw it at a full text search engine.

test=> select to_tsvector('simple', 'Beatles, The - Imagine');
           to_tsvector           
---------------------------------
 'beatles':1 'imagine':3 'the':2
(1 row)

test=> select to_tsvector('simple', 'Beatles, The - Imagine') @@ to_tsquery('simple', 'Beatles');
 ?column? 
----------
 t
(1 row)

If you were able to turn this into field-separated normalized data, your searches would become much more powerful as you could do weighted matches on fields using setweight, ts_rank, tsvector concatenation with ||, etc.

rossum · Accepted Answer · 2014-01-18 11:07:46Z

0

In principle, if any record in your database could contain your search string then you are going to have to search every record in your database.

What you can do is to use something like the Rabin-Karp algorithm to search for a lot of same-length versions of your search string simultaneously: "Beatles The", "The Beatles". If you ignore spaces and/or punctuation then you might be able to reduce the number of passes even more: "The Beatles", "Beatles, The", "Beatles The". All the examples in Craig Ringer's answer are the same length if you count only letters; you could find all those matches with a single pass through the database using Rabin-Karp

answered Jan 18, 2014 at 11:07

rossum

15.7k2 gold badges26 silver badges40 bronze badges

Collectives™ on Stack Overflow

Algorithm: Extract keyword from string by database

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related