1

Given the following json string: {"key":"val"ue","other":"invalid ""quo"te"}

I want to capture each illegal double quote inside the values. In the example there is one double quote in the value of the key property and there are three double quotes in the property called other.

I've seen multiple comments noting that this is invalid json (correct) and that the supplied json should be valid before receiving. However this is not possible in my case.

Assuming that this would only occur in the values and not in keys I think it's safe to assume that a starting sequence would be a colon followed by a double quote. An ending sequence would be a double quote followed by comma OR closing curly brace.

I've tried the following regex (among many other versions) which is the closest so my desired solution:

/:\s?".*?(").*?[,}]/i

This correctly captures the one double quote in the key property, but only captures the first double quote in the 'other' property. I would like it to capture the other two double quotes as well as a separate capture.

Another regex I've tried: /:\s?".*?("{1,})[^,}].*?[,}]/i This does the same as the first regex, but captures the two double quotes in one capture (not preferable)

My goal ultimately is to capture each double quote separately, so four captures. What I think I need in order to accomplish this is a way to make the capture group 'greedy?' so that it doesn't stop at the first double quote.

How could I achieve this?

I am using the following PHP code to test the Regex:

$text = '{"key":"val"ue","other":"invalid ""quo"te"}';
$pattern = '/:\s?".*?(").*?[,}]/i';
preg_match_all($pattern, $text, $matches, PREG_OFFSET_CAPTURE);
echo '<pre>' . print_r($matches, true) . '</pre>';
9
  • You say you can't get the source JSON fixed, but that's absurd. Anything producing JSON this broken is probably going to produce other incorrect output too. Don't trust it. Commented Sep 23, 2023 at 22:04
  • Nevertheless, getting the source json fixed is not the point. Going into details is a waste of time due to the nature of the feature that allows this. Commented Sep 23, 2023 at 22:46
  • 1
    Is your goal to just capture the illegal quotes, or to try and make the JSON valid? Commented Sep 24, 2023 at 0:03
  • 1
    Are these (regex101 demo) the invalid quotes? Commented Sep 24, 2023 at 10:07
  • 1
    @SomewhatBeginner If you change } to [:}] at first glance it looks like it will cover this, not sure (updated regex). Commented Sep 24, 2023 at 12:02

2 Answers 2

2

What you could do is to use a variant of The Trick...

The trick is that we match what we don't want on the left side of the alternation (the |), then we capture what we do want on the right side.

The good thing about PCRE is that there are verbs available to just skip the left side.

(?:(?:"\s*[:,]|\{)\s*"|\\"|"\s*[:}])(*SKIP)(*F)|"

See this demo at regex101

On the left side of the (*SKIP)(*F) preceded alternation all the "correct" quotes get matched (regex101) and skipped. Any remaining quotes are matched on the right side |" individually.

Finally you can use the PREG_OFFSET_CAPTURE to get the position of each "illegal quote".

Sign up to request clarification or add additional context in comments.

6 Comments

For the sake of skipping double quotes that were escaped I edited the above regex to exclude escaped double quotes. /(?:(?:"\s*[:,]|\{)\s*"|\\"|"\s*[:}])(*SKIP)(*F)|"/i
@SomewhatBeginner You bet I were already thinking about escaped quotes, but decided why doing more than necessary. :D Yea, you got the understanding of the idea, so you can finetune it to your needs and... The Trick is some good thing to know anyways.
@SomewhatBeginner I updated the answer according your modification. Regarding @mickmackusa comment, I further removed the i-flag from the regex101-demos. No idea how they went there! :)
Somehow I can't see their comment, but is there a performance loss if the i-flag is used when it's not needed?
@SomewhatBeginner I guess not! Just forgot to remove it, I was doing some other regex there before.
|
1

I wouldn't use regex for this. I would just manually scan the string:

function detectIllegals($text)
{
    $illegals = [];
    $indideString = false;
    $len = strlen($text);
    for($i=0;$i<$len;$i++)
    {
        $c = $text[$i];
        if($c=='"')
        {
            if($indideString)
            {
                $c2 = $text[$i+1];
                if($c2==':' || $c2==',' || $c2=='}')
                    $indideString = false;
                else
                    $illegals[] = $i;
            }
            else
                $indideString = true;
        }
    }
    return $illegals;
}

$text = '{"key":"val"ue","other":"invalid ""quo"te"}';
$a = detectIllegals($text);
print_r($a);

Output:

Array
(
    [0] => 11
    [1] => 33
    [2] => 34
    [3] => 38
)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.