Using REGEX to Handle Nested Double Quotes in JSON Strings in Python

Question

I'm using Generative AI API to return text responses as JSON strings which I intend to feed data into an application in real time. The problem is that often the JSON response provided by GenAI API includes small errors- most commonly with double quotes. These syntax issues in the response JSON string trigger errors in my python code when converting them to JSON.

For instance, I have the following JSON string:
'{"test":"this is "test" of "a" test"","result":"your result is "out" in our website"}'

As you can see, the value for "test" has multiple double quotations. So if I try to convert this to json, I get an error. What I want to do is utilize regex to convert the double quotations to single quotations. So the result can look as follows:
'{"test":"this is 'test' of 'a' test'", "result": "your result is 'out' in our website"}'

The best I can do is as follows:

def repl_call(m):
    preq = m.group(1)
    qbody = m.group(2)
    qbody = re.sub( r'"', "'", qbody )
    return preq + '"' + qbody + '"'

print( re.sub( r'([:\[,{]\s*)"(.*?)"(?=\s*[:,\]}])', repl_call, text ))

The following code successfully returns the intended result. However, if I were to add a comma, such as
{"test":"this is "test" of "a", test"","result":"your result is "out" in our website"}

...the code breaks and returns the following:
'{"test":"this is 'test' of 'a", test"","result":"your result is 'out' in our website"}'

:(

I've presently have tried to improve my AI prompt (prompt engineering) to avoid the double quotations and return only a valid JSON string. This works to some degree, but I still encounter enough errors in syntax that require me to retry the same prompt multiple times- which incurs unnecessary delays and costs.

My question is: Is there such thing as a common function and REGEX pattern I can apply in python to fix my JSON string so that it properly cleanses syntax errors? Specifically relating to misplaced double quotes.

I'm open to a variety of suggestions, including possible Python packages that can deal with JSON string cleansing. Even any advice on advanced GenAI tools that do JSON enforcement. I presently use Gemeni- which I like a lot. But doesn't allow JSON enforcement like OpenAI's API allows more explicitly.

About I presently use Gemeni- which I like a lot. But doesn't allow JSON enforcement like OpenAI's API allows more explicitly., is this report useful? medium.com/google-cloud/… — Tanaike
– Tanaike, Commented Aug 15, 2024 at 5:18
Please edit your question and include minimal reproducible example — Linda Lawton - DaImTo
– Linda Lawton - DaImTo, Commented Aug 15, 2024 at 6:35
Using regex you could match the value up to the next key or end of the string. Capture the value and use a function to convert double to single quotes, see this Python demo (regex101) — bobble bubble
– bobble bubble, Commented Aug 15, 2024 at 9:29

Linda Lawton - DaImTo · Accepted Answer · 2024-08-15 06:44:49Z

If you are requesting JSon back you should be using the response_mime_type and then you will not have these issues with parsing the JSon.

from dotenv import load_dotenv
import google.generativeai as genai
import os

load_dotenv()
genai.configure(api_key=os.environ['API_KEY'])
MODEL_NAME_LATEST = os.environ['MODEL_NAME_LATEST']

model = genai.GenerativeModel(
    model_name=MODEL_NAME_LATEST,
    # Set the `response_mime_type` to output JSON
    generation_config={"response_mime_type": "application/json"})

prompt = """
  List 5 popular cookie recipes.
  Using this JSON schema:
    Recipe = {"recipe_name": str}
  Return a `list[Recipe]`
  """

response = model.generate_content(prompt)
print(response.text)

Just remember to ensure that the JSon object you tell it to use is actually correct JSon or it may build it incorrectly include all , where they should be

response schema

Another option would be to use response schema.

from dotenv import load_dotenv
import google.generativeai as genai
import os
import typing_extensions as typing

load_dotenv()
genai.configure(api_key=os.environ['API_KEY'])
MODEL_NAME_LATEST = os.environ['MODEL_NAME_LATEST']


class Recipe(typing.TypedDict):
    recipe_name: str


model = genai.GenerativeModel(
    model_name=MODEL_NAME_LATEST,
    # Set the `response_mime_type` to output JSON
    # Pass the schema object to the `response_schema` field
    generation_config={"response_mime_type": "application/json",
                       "response_schema": list[Recipe]})

prompt = "List 5 popular cookie recipes"

response = model.generate_content(prompt)
print(response.text)

see Json mode

Collectives™ on Stack Overflow

Using REGEX to Handle Nested Double Quotes in JSON Strings in Python

1 Answer 1

response schema

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

response schema

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related