Error when using a custom Pydantic class for structured output with Langchain

Question

I'm trying to build a chain which will chunk a long PDF document (currently loaded in as markdown). I have the following Pydantic classes created.

from langchain.pydantic_v1 import BaseModel, Field
from typing import List

class HeaderSection(BaseModel):
    """Class to save a section header and text from the section"""

    header: str = Field(description="Header of a section from the document.")
    text: str = Field(description="Text under the associated header.")


class AllSections(BaseModel):   

    sections: List[HeaderSection]

I then have this code chunk setting the structured output.

from langchain_anthropic import ChatAnthropic

# llm = ChatOpenAI(temperature=0, model_name="gpt-4o")
llm = ChatAnthropic(model="claude-3-5-sonnet-20240620")
structured_json_output = llm.with_structured_output(AllSections)

system = """You are tasked with splitting up a document into sections. These sections all have headers, which is how you will determine where to split all of the data.
Return the header in the "header" field of the HeaderSection class. The text that comes below the header, return the text as part of the "text" field of the HeaderSection class. \
You will be given the input text, and the headers.
Here's an example on how a chunk of data will be stored.

EXAMPLE TEXT:
# Bill of Lading and Driver Signature

Item 6

The signature of a Carrier Freight Driver/Sales Representative on any Bill of Lading other than a Carrier’s Bill of Lading will act only to acknowledge the receipt of freight as described on the document. This signature will not acknowledge agreement to any terms and conditions of carriage and/or liability conditions that may also appear on the document. Unless there is a written agreement, separate from the Bill of Lading, signed by shipper and Carrier, then the Carrier Freight Bill of Lading Terms and Conditions will apply.

EXAMPLE OUTPUT:
HeaderSection(header="Bill of Lading and Driver Signature Item 6", text="The signature of a Carrier Freight Driver/Sales Representative on any Bill of Lading other than a Carrier’s Bill of Lading will act only to acknowledge the receipt of freight as described on the document. This signature will not acknowledge agreement to any terms and conditions of carriage and/or liability conditions that may also appear on the document. Unless there is a written agreement, separate from the Bill of Lading, signed by shipper and Carrier, then the Carrier Freight Bill of Lading Terms and Conditions will apply.")

NOTE: The item number may not always be present.
"""

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system),
        ("human", "Headers: {headers}\n\nText: {text}")
    ]
)

chunking_chain = prompt | structured_json_output
output = chunking_chain.invoke({"headers": llama_parse_headers_and_items, "text": llama_data_md[0].text})

And receive this error

ValidationError: 1 validation error for AllSections
sections
  field required (type=value_error.missing)

When I set the HeaderSection class as the structured output class, it works and returns one section, but I need it to return all of the sections, which is why I'm trying to use AllSections class, which is suppose to have a list of HeaderSection. Any idea what this error means in this context, and how I can get this to run?

You should modify the prompt to return a list of HeaderSection. Your example output should have a AllSections with its sections being a list of HeaderSections — InsertCheesyLine
– InsertCheesyLine, Commented Jul 9, 2024 at 9:15
@InsertCheesyLine I see what you mean by fixing the example output, but what do you mean by return a list of HeaderSection? Like, the .with_structured_output() should take list[HeaderSection]? — Alex McGraw
– Alex McGraw, Commented Jul 9, 2024 at 18:52

InsertCheesyLine · Accepted Answer · 2024-07-10 04:24:47Z

0

You are passing AllSections to with_structured_output so you need to define an example output that follows your schema.

Eg:

EXAMPLE OUTPUT:

AllSections(
  sections=[
  HeaderSection(header="Bill of Lading ...", text="The signature..."),
  HeaderSection(header="Bill of Lading ...", text="The signature..."),
...)

answered Jul 10, 2024 at 4:24

InsertCheesyLine

1,4182 gold badges15 silver badges26 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Error when using a custom Pydantic class for structured output with Langchain

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related