5

In python, I am looking for python code which I can use to create random data matching any regex. For example, if the regex is

\d{1,100}

I want to have a list of random numbers with a random length between 1 and 100 (equally distributed)

There are some 'regex inverters' available (see here) which compute ALL possible matches, which is not what I want, and which is extremely impracticable. The example above, for example, has more then 10^100 possible matches, which never can be stored in a list. I just need a function to return a match by random.

Maybe there is a package already available which can be used to accomplish this? I need a function that creates a matching string for ANY regex, not just the given one or some other, but maybe 100 different regex. I just cannot code them myself, I want the function extract the pattern to return me a matching string.

8
  • Is \d{1,100} the real example? Could you paste some that you work with in real? If the regex is not too fuzzy, you can generate random string that is close to the matching expression and discard it if it does not match. Commented Jul 31, 2013 at 5:46
  • 1
    The example given is 'a' example. I do not know yet what I will need exactly. It can be much more complex like [A-C]{2}\d{2,20}@\w{10,1000}. Commented Jul 31, 2013 at 5:49
  • 1
    What sort of distribution do you want over the possible strings? For your example, for instance, would you want most of the random results to be 100 digits long (since 90% of the valid matching strings will have that many)? Or would you want each length to occur with equal chance? Commented Jul 31, 2013 at 5:58
  • The latter. Each length should occur with equal chances. Commented Jul 31, 2013 at 6:01
  • 1
    Well, you can't possibly solve this for all legal regex patterns, since anything that can match an infinite length string (like \d*) would have an infinite number of possible lengths to chose from. But if you limit the regex syntax that's allowed a bit, it's probably doable. Commented Jul 31, 2013 at 6:12

3 Answers 3

4

Two Python libraries can do this: sre-yield and Hypothesis.

  1. sre-yield

sre-yeld will generate all values matching a given regular expression. It uses SRE, Python's default regular expression engine.

For example,

import sre_yield
list(sre_yield.AllStrings('[a-z]oo$'))
['aoo', 'boo', 'coo', 'doo', 'eoo', 'foo', 'goo', 'hoo', 'ioo', 'joo', 'koo', 'loo', 'moo', 'noo', 'ooo', 'poo', 'qoo', 'roo', 'soo', 'too', 'uoo', 'voo', 'woo', 'xoo', 'yoo', 'zoo']

For decimal numbers,

list(sre_yield.AllStrings('\d{1,2}'))
['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99']
  1. Hypothesis

The unit test library Hypothesis will generate random matching examples. It is also built using SRE.

import hypothesis
g=hypothesis.strategies.from_regex(r'^[A-Z][a-z]$')
g.example()

with output such as:

'Gssov', 'Lmsud', 'Ixnoy'

For decimal numbers

d=hypothesis.strategies.from_regex(r'^[0-9]{1,2}$')

will output one or two digit decimal numbers: 65, 7, 67 although not evenly distributed. Using \d yielded unprintable strings.

Note: use begin and end anchors to prevent extraneous characters.

Sign up to request clarification or add additional context in comments.

1 Comment

k=list(sre_yield.AllStrings('[a-zA-Z]\d{7}')) Is there way to limit the numbers as to generate 9 digit it would take lot of time.
2

If the expressions you match do not have any "advanced" features, like look-ahead or look-behind, then you can parse it yourself and build a proper generator

Treat each part of the regex as a function returning something (e.g., between 1 and 100 digits) and glue them together at the top:

import random
from string import digits, uppercase, letters

def joiner(*items):
    # actually should return lambda as the other functions
    return ''.join(item() for item in items)  

def roll(item, n1, n2=None):
    n2 = n2 or n1
    return lambda: ''.join(item() for _ in xrange(random.randint(n1, n2)))

def rand(collection):
    return lambda: random.choice(collection)

# this is a generator for /\d{1,10}:[A-Z]{5}/
print joiner(roll(rand(digits), 1, 10),
             rand(':'),
             roll(rand(uppercase), 5))

# [A-C]{2}\d{2,20}@\w{10,1000}
print joiner(roll(rand('ABC'), 2),
             roll(rand(digits), 2, 20),
             rand('@'),
             roll(rand(letters), 10, 1000))

Parsing the regex would be another question. So this solution is not universal, but maybe it's sufficient

2 Comments

This is pretty close to what I was going to say in my own answer. Yes, it can be done, but just parsing the regex pattern is enough of a challenge that I don't think any random SO answerer is going to produce working code.
@Blckknght: parsing a regex would be difficult but definitely doable, you can use regex for that :)
0

From this answer

You could try using python to call this perl module:

https://metacpan.org/module/String::Random

1 Comment

I would prefer a python only solution, as the perl solution requires (i) to fix a problem with an String/Random import (ii) to pass on parameters to the perl function (iii) to return the output and (iv) probably other issues concerning running this piece of code on different machines, Linux and Windows.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.