Generating regex, from hosts file using bash/sed/awk

Question

I have a hosts file which is in the following format:

# comments

(ipv4/ipv6 address) (multiple hostnames)
.
.
.

I need to convert them to an optimised regular expression using bash/sed/awk. For example, if we have the following in the hosts file:

127.0.0.1 abc.example.com def.examples.com
127.0.0.1 ghi-example.com foobar.com
127.0.0.1 malwaredomain.com malware-domain.com

to be converted as:

(((abc|def)\.|ghi-)\.example\.com|foobar\.com|malware-?domain\.com)

It may be preferable to also have some intelligent conversion. For example, if we have lots of similar entries like:

127.0.0.1 ad-us.adserver.com ad-uk.adserver.com ad-fr.adserver.com ad-de.adserver.com
127.0.0.1 ad-ru.adserver.com ad-ca.adserver.com ad-se.adserver.com ad-be.adserver.com
...

They may be converted as ad\..*\.adserver.com, maybe even as ad\..{2}\.adserver\.com. Of course something like ad-(us|uk|fr|de|ru|ca|se|be)\.adserver\.com works, but I'd prefer to have a generic rule since there's the additional benifit of detecting servers that may be added later.

EDIT: Summarising, if I have I have a hosts file like this:

127.0.0.1 atmdt.com foo.atmdt.com bar.admdt.com
127.0.0.1 anifkalood.ru boeing-job.com ilianorkin.ru humaniopa.ru
127.0.0.1 hillairusbomges.ru mgithessia.biz justintvfreefall.org

The output will be a regex which covers all the servers above:

((((foo|bar)\.?atmdt|boeing-job)\.com)|(anifkalood|hillairusbomges|ilianorkin|humaniopa)\.ru|mgithessia\.biz|justintvfreefall\.org)

How can I acheive this?

Thanks in advance.

The Perl module search.cpan.org/~manu/Net-IP-1.26/IP.pm might be of interest — cdarke
– cdarke, Commented Mar 28, 2013 at 11:26
The problem is in defining the limits of what should match/should not match. After all .* would meet your requirements for a general rule, since it will match any entry! (and you could consider that optimised) — cdarke
– cdarke, Commented Mar 28, 2013 at 11:30
An implementation of this that looked like what you wanted (optimizing) would typically be done by building a tree. Bash (prior to the unreleased 4.3, which adds namerefs from ksh) doesn't support pointers or references, which are necessary for trees, so the facilities necessary for a sane and reasonable implementation are not present. Ignoring the shortest-possible condition, you could simply convert the . instances (or, ideally, any characters not explicitly whitelisted as safe) to [.], add a ( and ) at the beginning and end and separate by |, but, well, that's not so interesting. — Charles Duffy
– Charles Duffy, Commented Mar 28, 2013 at 11:40

Édouard Lopez · Accepted Answer · 2013-03-29 18:56:46Z

2

You seems to be looking for a regex generator. Here are some :

Regexp::List - builds regular expressions out of a list of words (Perl) - choosed by questionner/@user2064000
Automatic Generation of Regular Expressions from Examples (Java/Javascript syntax)
JavaScript Regex Generator (beta) (Javascript syntax)
knowing is obsolete :: regular expression generator (not sure: perl php python java javascript coldfusion c c++ ruby vb vbscript j# c# c++.net vb.net)
Re-Gen - a Regex Generator (python syntax)

I would recommend the Genetic approach, but not sure about the optimization level they have.

edited Mar 29, 2013 at 18:56

answered Mar 28, 2013 at 11:54

Édouard Lopez

43.8k30 gold badges134 silver badges190 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

user2064000 Over a year ago

Yes, I chose the regex generator, but I'll chose this one instead of the others you posted: search.cpan.org/~dankogai/Regexp-Optimizer-0.15/lib/Regexp/…

Miklos Aubert · Accepted Answer · 2013-03-28 11:36:03Z

0

This looks more like a Computer Science project than a simple programming question!

I don't think you'll find any straightforward bash/sed/awk instructions to do this. You want to create regular expressions programmatically, and sed/awk are typically more suited to using regexes. I guess you'd have to look into approximate string matching and specifically, computing the Levenshtein distance between two strings.

answered Mar 28, 2013 at 11:36

Miklos Aubert

4,6152 gold badges26 silver badges33 bronze badges

Collectives™ on Stack Overflow

Generating regex, from hosts file using bash/sed/awk

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related