1

I have an output from the LLM call, which includes a string list. But the LLM is not smart enough to escape the double quote (") in the string, which will cause the json.loads error. How could I replace the double quote (") with (\") so that the json.loads can run without error.

Following is the example data, the "15" deep-dish cast aluminum wheels" will cause the error:

{
  "Equipment Contents": [
    "XL APPEARANCE GROUP",
    "body-color grille",
    "body-color body-side molding with chrome insert",
    "luggage rack",
    "15" deep-dish cast aluminum wheels",
    "P235/75R15SL all-terrain OWL tires",
    "power door locks",
    "power windows",
    "power mirrors",
    "cloth captains chairs"
  ]
}
2
  • 2
    Could you tell the LLM to escape the double-quotes? Or maybe tell it to replace them with the word "inch"? Commented Jul 4 at 14:48
  • 1
    I think if there was reliable way to detect "quotes that should be escaped" it would be baked in already existing solutions for JSONs. However, this throws error for a reason, because JSON is unreadable. And in your case it might seem obvious, but you won't find easy solution that would handle everything. You must correct your input data from LLM. Commented Jul 6 at 7:04

3 Answers 3

1

Update: Please disregard this solution because it is flawed. Thanks to @jhnc for giving me the sample that exposes the flaws. The errant unescaped quotes within the strings cannot be found like this and in my estimation, by no other method.

This might work. Uses PCRE to skip past valid syntax.
This can also work using Python regex engine "import regex" -
This must be installed first.

This is tested and works on your sample but that doesn't mean it will work
on all cases. Its unknown on all possible cases.

Unfortunately these are the only engines that would support this regex:
PCRE/Perl/Python regex/ECMAScript/C# (all Dot Net).
Otherwise a (Json) descent parser is needed.

(?:[\[{,:\]}]+[\[{,:\]}\s]*"(*SKIP)(*FAIL)|"(?!\s*[\[{,:\]}]))

replace \\$0

https://regex101.com/r/N13TgZ/1

(?:
   [\[{,:\]}]+ 
   [\[{,:\]}\s]* 
   "
   (*SKIP) (*FAIL) 
 | 
   "
   (?! \s* [\[{,:\]}] )
)

ECMAScript version

(?<![\[{,:\]}]+[\[{,:\]}\s]*)"(?!\s*[\[{,:\]}])

replace \\$&

https://regex101.com/r/ZdHaDw/1

These regex work on minimized json as well.
Minimized Source :

{"Equipment Contents":["XL APPEARANCE GROUP","body-color grille","body-color body-side molding with chrome insert","luggage rack","15" deep-dish cast aluminum wheels","P235/75R15SL all-terrain OWL tires","power door locks","power windows","p ower mirrors","cloth captains chairs"]}

PCRE/Perl: https://regex101.com/r/qSTp8S/1
ECMAScript: https://regex101.com/r/CG74kA/1

To use the Python 3rd party regex module, it has to be installed intp your Python setup.

I have a Windows box with Python 3.7 installed along with the regex module.
Use the pip install command pip install regex. I think it's here
https://pypi.org/project/regex but not sure.

This is what it runs like on my box:

>>> import regex
>>>
>>> pattern = regex.compile(r'''(?:[\[{,:\]}]+[\[{,:\]}\s]*"(*SKIP)(*FAIL)|"(?!\s*[\[{,:\]}]))''')
>>>
>>> data = r'''{
...   "Equipment Contents": [
...     "XL APPEARANCE GROUP",
...     "body-color grille",
...     "body-color body-side molding with chrome insert",
...     "luggage rack",
...     "15" deep-dish cast aluminum wheels",
...     "16" x 6.5" Chrome Wheels",
...     "P235/75R15SL all-terrain OWL tires",
...     "power door locks",
...     "power windows",
...     "power mirrors",
...     "cloth captains chairs"
...   ]
... }
... '''
>>>
>>> result_string = regex.sub(pattern, '\\"', data)
>>>
>>> print(result_string)
{
  "Equipment Contents": [
    "XL APPEARANCE GROUP",
    "body-color grille",
    "body-color body-side molding with chrome insert",
    "luggage rack",
    "15\" deep-dish cast aluminum wheels",
    "16\" x 6.5\" Chrome Wheels",
    "P235/75R15SL all-terrain OWL tires",
    "power door locks",
    "power windows",
    "power mirrors",
    "cloth captains chairs"
  ]
}

>>>
>>>
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks. My code is under the python IDE. I'm not very familiar with the regex. Could you please write a python code, please?
Ok, added the Python demo.
is this problem generally solvable? ..., "15", 16", and 17" wheels", ...
Yeah so is the quote in error or is the comma in error? Depends on which side the approach is from. Either way throws off the other. In general then, none of this is solvable. Although the simplistic intentional design of json dictates a single string per line. However there is a reason json is valid when minimalized, which blows up everything. So I learned today. To the OP: At your own risk!
1

MAIN REGEX PATTERN Updated to allow quotes in the middle of the strings. (2025 July 16)

This code will replace every inner double quote, i.e. double quote between double quotes.

2-Step Approach: First, from the list, find the strings that contain (inner) double quotes, and then replace the double quotes in those strings. The string must begin and end with the double quote. The opening quote of the string is preceded by a comma, or by a opening angle bracket followed by whitespace. The closing quote of the string is followed by a comma, or by whitespace followed by a closing angle bracket.

REGEX PATTERN (Updated): Matches strings with inner quotes (python re module regex flavor):

(?<=\[|,)(?P<keep_whitespace>\s*\")(?P<fix>(?:[^\"]+)(?:\",?(?!\n)[^\",]*)+[^\"]*)(?=\"(?:,\n|\s*\]))

Regex demo (updated): https://regex101.com/r/NGvuwC/3

PYTHON (Updated):


import re

def replacement_function(match):
    # Escapes the double quotes

    group1 = match.group('keep_whitespace')
    group2 = match.group('fix')

    pattern = r'\"'
    replacement = r'\\"'
    group2_fixed = re.sub(pattern, replacement, group2)

    replacement_string = r'{}{}'.format(group1, group2_fixed)
    
    return replacement_string


def escape_inner_quotes(text): 
    pattern = r'(?<=\[|,)(?P<keep_whitespace>\s*\")(?P<fix>(?:[^\"]+)(?:\",?(?!\n)[^\",]*)+[^\"]*)(?=\"(?:,\n|\s*\]))'

    # Calls the replacement_function with the match as parameter (re feature)
    updated_text = re.sub(pattern, replacement_function, text)

    return updated_text

escape_inner_quotes(test_string)

TEST STRING: (Updated)


test_string = r'''
{
  "Equipment Contents": [
    "XL APPEARANCE GROUP",
    "body-color grille",
    "15" d"eep-"dish cast aluminum "wheels",
    "15" d"eep-"dish cast aluminum "wheels",
    "15" d"eep-"dish cast aluminum "wheels",
    "15", 16", and 17" wheels",
    "body-color body-side molding with chrome insert",
    "luggage rack",
    "15" deep-dish cast aluminum wheels",
    "P235/75R15SL all-terrain OWL tires",
    "power door locks",
    "power windows",
    "power mirrors",
    "cloth captains chairs",
    "15" d"eep-"dish cast aluminum wheels"
  ]
}


OUTPUT (Updated):

{
  "Equipment Contents": [
    "XL APPEARANCE GROUP",
    "body-color grille",
    "15\" d\"eep-\"dish cast aluminum \"wheels",
    "15\" d\"eep-\"dish cast aluminum \"wheels",
    "15\" d\"eep-\"dish cast aluminum \"wheels",
    "15\", 16\", and 17\" wheels",
    "body-color body-side molding with chrome insert",
    "luggage rack",
    "15\" deep-dish cast aluminum wheels",
    "P235/75R15SL all-terrain OWL tires",
    "power door locks",
    "power windows",
    "power mirrors",
    "cloth captains chairs",
    "15\" d\"eep-\"dish cast aluminum wheels"
  ]
}

REGEX NOTES:

  • (?<= Positive look behind, (?<=...). Matches if this index position is preceded by the following string. This will not consume any characters.
    • \[ Match literal [.
    • | or, (|).
    • , Match literal ,
  • )
  • (?P<keep_whitespace> Begin named group, (P<name>...). This group, 'keep_whitespace' matches and captures the whitespace after the opening bracket, or after a comma, before the opening quote, and followed by the opening quote. We need to keep this string as we will use it in the replacement string.
    • \s* Matches 0 or more (*) whitespace characters, including newline.
    • \" Match literal double quote, ". This is the opening quote.
  • )
  • (?P<fix> Begin named group, (P<name>...), called 'fix'. This group matches and captures a string that contains double quotes.
    • (?:
      • [^\"]+ Negated character class, [^...]. Matches any characters that is not a literal ", 1 or more times (+).
    • )
    • (?: Begin non-capturing group, (?:...).
      • \" Matches literal ".
      • ,? Matches 0 or 1 (?) literal commas, , that is not followed by a newline character:
      • (?!\n) Negative lookahead (?!...). Matches if not followed by a newline, (\n).
      • [^\",]* Negated character class, [^...]. Matches any characters that are not a literal " or a literal ,, 0 or more times (,). We do not want to match a comma here because that would allow us to step over the commas after the end quotes.
    • )+ Match 1 or more times (+).
    • [^\"]* Negated character class, [^...]. Matches any character that is not a literal ", 0 or more times (*).
  • )
  • (?= Positive lookahead, (?=...). Matches if this index position is followed by the following string. Will not consume any characters.
    • \" Match literal double quote, ". This is the closing quote.
    • (?: Begin non-capturing group, (?:...).
      • , Match literal , but only if it is followed by a newline:
      • \n Match newline, \n.
      • | or, (|).
      • \s* Matches 0 or more (*) whitespace characters, including newline.
      • \] Match literal ].
    • )
  • )

3 Comments

Fails here "15", 16", and 17" wheels", regex101.com/r/NGvuwC/1
Thank you. I have updated the pattern (and answer) to allow commas in the middle of the string items. I left them out in my original answer because I did not see commas in the original test strings in the question. It is however better to include the commas inside the strings just in case. (regex101.com/r/NGvuwC/2) :)
@sln Updated one (3 more) time. There was a space [ ] after the "cloth captains chairs", in the test string. I removed the space after that comma before the newline in the test_string and updated the output. The main regex pattern does not account for spaces after the comma before the newline. (If there is space after the comma before the newline, it should be added to the regex to allow for it.) :)
0

Try this only if you have unminimized JSON only with one odd quote


Try the following expression pattern:

import re

def repl(m):
    match = m.group(0)
    return match.replace(m.group(1), '\\"')

pattern = '(?<=").*?(").*?(?=")'

result_string = re.sub(pattern, repl, string)

Function repl accepts all matches (e.g. 15" deep-dish cast aluminum wheels), and returns the string with " replaced with \".

In turn, re.sub performs a replace with the given pattern.

Pattern explanation

  • (?<=") is a positive lookbehind to find the needed opening quote " before the string, but without including it;

  • .*? matches all characters (without line terminators);

  • (") matches " in a separate matching group, so it can be found and replaced in repl(m) function;

  • (?=") is a positive lookahead to check that after the needed string there is a closing quote, but without including it.

Example

Input: "15" deep-dish cast aluminum wheels".

Output: "15\" deep-dish cast aluminum wheels".

But it does not work if you have more than one quote:

Input: "15" d"eep-"dish cast aluminum wheels".

Output: "15\" d"eep-\"dish cast aluminum wheels".

4 Comments

Produces wrong output when there are multiple unescaped quotes to escape.
In the answer it is said "it does not work if you have more than one quote"
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.