2

This question is almost the opposite of Efficient data structure for word lookup with wildcards

Suppose we have a database of urls

http://aaa.com/
http://bbb.com/
http://ccc.com/
....

To find if a url is on the list I can make a binary-search and get the results in O(log n) time, n the size of the list.

This structure served well for many years but now I'd like to have wildcards in the database entries, like:

http://*aaa.com/*
http://*bbb.com/*
http://*ccc.com/
....

And the naive search would result in a full scan with O(n) time for finding.

Which data structure could have find in less than O(n)?

7
  • you could still do binary search, but maintain the sorted lists of know urls with strings starting from behind Commented Dec 23, 2014 at 18:01
  • the query url: http://test.ccc.com/ result true Commented Dec 23, 2014 at 18:06
  • is http ://sasccc.com a valid query ie without a dot separator ? Commented Dec 23, 2014 at 18:23
  • could you split the urls into a fixed number of fields, where a field could be wild or specified? or do you need wild cards to be able to appear anywhere in the url (e.g http*://*ca*.c/*/*.html)? Commented Dec 23, 2014 at 18:38
  • possible duplicate of Efficient data structure for word lookup with wildcards Commented Dec 23, 2014 at 19:03

1 Answer 1

2

If the all the urls are known beforehand, then you could just build a finite automaton, which will solve your problem with queries in O(url length).

This finite automaton can be built as a regexp:

http://(.*aaa\.com/.*|.*bbb\.com/.*|.*ccc\.com/)$

Here's some python code. After re.compile(), each query is very fast.

import re

urls = re.compile("http://(.*aaa\.com/.*|.*bbb\.com/.*|.*ccc\.com/)$")

print urls.match("http://testaaa.com/") is not None
> True
print urls.match("http://somethingbbb.com/dir") is not None
> True
print urls.match("http://ccc.com/") is not None
> True
print urls.match("http://testccc.com/") is not None
> True
print urls.match("http://testccc.com/ddd") is not None
> False
print urls.match("http://ddd.com/") is not None
> False
Sign up to request clarification or add additional context in comments.

2 Comments

I guess you cannot re.compile a very large string :)
If the regexp implementation is not up to the task, you can always build the automaton yourself. This will provide you better control over how much memory is used.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.