1

Introduction:

I have the following scenario in PostgreSQL whereby I want to perform some data validation on a .csv string prior to inserting it into a table (see the fiddle here).

I've managed to get a regex (in a CHECK constraint) which disallows spaces within strings (e.g. "12 34") and also disallows preceding zeros ("00343").

Now, the icing on the cake would be if I could use regular expressions to disallow strings which contain a repeat of an integer - i.e. if a sequence \d+ matched another \d+ within the same string.

Is this beyond the capacities of regular expressions?

My table is as follows:

CREATE TABLE test
(
  data TEXT NOT NULL,
  
  CONSTRAINT d_csv_only_ck 
    CHECK (data ~ '^([ ]*([1-9]\d*)+[ ]*)(,[ ]*([1-9]\d*)+[ ]*)*$')

);

And I can populate it as follows:

INSERT INTO test VALUES 
('992,1005,1007,992,456,456,1008'),  -- want to make this line unnacceptable - repeats!
('44,1005,1110'), 
('13,  44  ,  1005,  10078  '),  -- acceptable - spaces before and after integers   
('11,1203,6666'),
('1,11,99,2222'),
('3435'),             
('  1234    '); -- acceptable

But:

INSERT INTO test VALUES ('23432, 3433   ,00343, 567'); -- leading 0 - unnacceptable

fails (as it should), and also fails (again, as it should)

INSERT INTO test VALUES ('12  34');  -- spaces within numbers - unnacceptable

The question:

However, if you notice the first string, it has repeats of 992and 456.

I would like to be able to match these.

All of these rules do not have to be in the same regex - I can use a second CHECK constraint.

I would like to know if what I am asking is possible using Regular Expressions?

I did find this post which appears to go some (all?) of the way to solving my issue, but I'm afraid it's beyond my skillset to get it to work - I've included a small test at the bottom of the fiddle.

Please let me know should you require any further information.

p.s. as an aside, I'm not very experienced with regexes and I would welcome any input on my basic one above.

6
  • I think I understand what you are trying to do but I wonder why? Why are adding the validation at the database layer? Could you possibly have it in the code that manages the database? What kind of data is it? Commented Aug 31, 2021 at 10:30
  • From the looks of it, this is simply a RegEx match problem. Commented Aug 31, 2021 at 10:31
  • @ErionOmeri - yes, it's a matching problem, but within the input string and not simply matching a given string literal that's already known with the input! Commented Aug 31, 2021 at 10:32
  • You might need to use regex “Look Ahead” to see if the string repeats. I am not sure if PostreSQL supports that in its implementation. You could create a function to make things a lot easier, but I would do this code, if at all. Commented Aug 31, 2021 at 10:38
  • 1
    PostgreSQL does have very sophisticated regex capabilites! Faleminderit for your input! Commented Aug 31, 2021 at 10:40

1 Answer 1

2

Since PostegreSQL regex does not support backreferences, you cannot apply this restriction because you would need a negative lookahead with a backreference in it.

Have a look at this PCRE regex:

^(?!.*\b(\d+)\b.*\b\1\b) *[1-9]\d* *(?:, *[1-9]\d* *)*$

See this regex demo. Details:

  • ^ - start of string
  • (?!.*\b(\d+)\b.*\b\1\b) - no same two numbers as whole word allowed anywhere in the string
  • * - zero or more spaces
  • [1-9]\d* - a non-zero digit and then any zero or more digits
  • * - zero or more spaces
  • (?:, *[1-9]\d* *)* - zero or more occurrences of
    • , * - comma and zero or more spaces
    • [1-9]\d* - a non-zero digit and then any zero or more digits
    • * - zero or more spaces
  • $ - end of string.

Even if you replace \b with \y (PostgreSQL regex word boundaries) in the PostgreSQL code, it won't work due to the drawback mentioned at the top of the answer.

Sign up to request clarification or add additional context in comments.

9 Comments

Thanks for your input! I already upvoted your contribution which I reference in my question, so I hope you won't mind a direct question! If you look here, your regex does what I want with INTs, it seems to "get confused" with alphabetical strings for some reason? Could you also provide a brief breakdown of what the bits do? Ideally, I'd like to know "how to fish..."!
@Vérace I do not quite understand what confusion you noticed here. What do you mean?
456, 897, 456 - is in white - a match on 456? asfdasdf adsfas ,adfas also in white - no match there?
@Vérace 456, 897, 456 should not match, there are two repeating numbers. asfdasdf adsfas ,adfas has no numbers and should not match.
Ah... OK, I was "inverting" them - if I want to reverse the "polarity" - i.e. make the white ones coloured and the coloured ones white - what do I have to do? (final question - then I'll upvote and mark as correct - and dziękuję). Trying to understand!... :-)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.