Efficient data structure to hold strings with wildcards

Question

This question is almost the opposite of Efficient data structure for word lookup with wildcards

Suppose we have a database of urls

http://aaa.com/
http://bbb.com/
http://ccc.com/
....

To find if a url is on the list I can make a binary-search and get the results in O(log n) time, n the size of the list.

This structure served well for many years but now I'd like to have wildcards in the database entries, like:

http://*aaa.com/*
http://*bbb.com/*
http://*ccc.com/
....

And the naive search would result in a full scan with O(n) time for finding.

Which data structure could have find in less than O(n)?

you could still do binary search, but maintain the sorted lists of know urls with strings starting from behind — advocateofnone
– advocateofnone, Commented Dec 23, 2014 at 18:01
is http ://sasccc.com a valid query ie without a dot separator ? — advocateofnone
– advocateofnone, Commented Dec 23, 2014 at 18:23
could you split the urls into a fixed number of fields, where a field could be wild or specified? or do you need wild cards to be able to appear anywhere in the url (e.g http*://*ca*.c/*/*.html)? — ryanpattison
– ryanpattison, Commented Dec 23, 2014 at 18:38
possible duplicate of Efficient data structure for word lookup with wildcards — 500 - Internal Server Error
– 500 - Internal Server Error, Commented Dec 23, 2014 at 19:03

Ricbit · Accepted Answer · 2014-12-23 18:51:40Z

2

If the all the urls are known beforehand, then you could just build a finite automaton, which will solve your problem with queries in O(url length).

This finite automaton can be built as a regexp:

http://(.*aaa\.com/.*|.*bbb\.com/.*|.*ccc\.com/)$

Here's some python code. After re.compile(), each query is very fast.

import re

urls = re.compile("http://(.*aaa\.com/.*|.*bbb\.com/.*|.*ccc\.com/)$")

print urls.match("http://testaaa.com/") is not None
> True
print urls.match("http://somethingbbb.com/dir") is not None
> True
print urls.match("http://ccc.com/") is not None
> True
print urls.match("http://testccc.com/") is not None
> True
print urls.match("http://testccc.com/ddd") is not None
> False
print urls.match("http://ddd.com/") is not None
> False

edited Dec 23, 2014 at 18:51

answered Dec 23, 2014 at 18:31

Ricbit

8398 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

ppaulojr Over a year ago

I guess you cannot re.compile a very large string :)

Ricbit Over a year ago

If the regexp implementation is not up to the task, you can always build the automaton yourself. This will provide you better control over how much memory is used.

Collectives™ on Stack Overflow

Efficient data structure to hold strings with wildcards

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related