0

I have a hosts file which is in the following format:

# comments

(ipv4/ipv6 address) (multiple hostnames)
.
.
.

I need to convert them to an optimised regular expression using bash/sed/awk. For example, if we have the following in the hosts file:

127.0.0.1 abc.example.com def.examples.com
127.0.0.1 ghi-example.com foobar.com
127.0.0.1 malwaredomain.com malware-domain.com

to be converted as:

(((abc|def)\.|ghi-)\.example\.com|foobar\.com|malware-?domain\.com)

It may be preferable to also have some intelligent conversion. For example, if we have lots of similar entries like:

127.0.0.1 ad-us.adserver.com ad-uk.adserver.com ad-fr.adserver.com ad-de.adserver.com
127.0.0.1 ad-ru.adserver.com ad-ca.adserver.com ad-se.adserver.com ad-be.adserver.com
...

They may be converted as ad\..*\.adserver.com, maybe even as ad\..{2}\.adserver\.com. Of course something like ad-(us|uk|fr|de|ru|ca|se|be)\.adserver\.com works, but I'd prefer to have a generic rule since there's the additional benifit of detecting servers that may be added later.

EDIT: Summarising, if I have I have a hosts file like this:

127.0.0.1 atmdt.com foo.atmdt.com bar.admdt.com
127.0.0.1 anifkalood.ru boeing-job.com ilianorkin.ru humaniopa.ru
127.0.0.1 hillairusbomges.ru mgithessia.biz justintvfreefall.org

The output will be a regex which covers all the servers above:

((((foo|bar)\.?atmdt|boeing-job)\.com)|(anifkalood|hillairusbomges|ilianorkin|humaniopa)\.ru|mgithessia\.biz|justintvfreefall\.org)

How can I acheive this?

Thanks in advance.

3
  • The Perl module search.cpan.org/~manu/Net-IP-1.26/IP.pm might be of interest Commented Mar 28, 2013 at 11:26
  • 1
    The problem is in defining the limits of what should match/should not match. After all .* would meet your requirements for a general rule, since it will match any entry! (and you could consider that optimised) Commented Mar 28, 2013 at 11:30
  • 1
    An implementation of this that looked like what you wanted (optimizing) would typically be done by building a tree. Bash (prior to the unreleased 4.3, which adds namerefs from ksh) doesn't support pointers or references, which are necessary for trees, so the facilities necessary for a sane and reasonable implementation are not present. Ignoring the shortest-possible condition, you could simply convert the . instances (or, ideally, any characters not explicitly whitelisted as safe) to [.], add a ( and ) at the beginning and end and separate by |, but, well, that's not so interesting. Commented Mar 28, 2013 at 11:40

2 Answers 2

2

You seems to be looking for a regex generator. Here are some :

I would recommend the Genetic approach, but not sure about the optimization level they have.

Sign up to request clarification or add additional context in comments.

1 Comment

Yes, I chose the regex generator, but I'll chose this one instead of the others you posted: search.cpan.org/~dankogai/Regexp-Optimizer-0.15/lib/Regexp/…
0

This looks more like a Computer Science project than a simple programming question!

I don't think you'll find any straightforward bash/sed/awk instructions to do this. You want to create regular expressions programmatically, and sed/awk are typically more suited to using regexes. I guess you'd have to look into approximate string matching and specifically, computing the Levenshtein distance between two strings.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.