1

I'm trying to build a chain which will chunk a long PDF document (currently loaded in as markdown). I have the following Pydantic classes created.

from langchain.pydantic_v1 import BaseModel, Field
from typing import List

class HeaderSection(BaseModel):
    """Class to save a section header and text from the section"""

    header: str = Field(description="Header of a section from the document.")
    text: str = Field(description="Text under the associated header.")


class AllSections(BaseModel):   

    sections: List[HeaderSection]

I then have this code chunk setting the structured output.

from langchain_anthropic import ChatAnthropic

# llm = ChatOpenAI(temperature=0, model_name="gpt-4o")
llm = ChatAnthropic(model="claude-3-5-sonnet-20240620")
structured_json_output = llm.with_structured_output(AllSections)

system = """You are tasked with splitting up a document into sections. These sections all have headers, which is how you will determine where to split all of the data.
Return the header in the "header" field of the HeaderSection class. The text that comes below the header, return the text as part of the "text" field of the HeaderSection class. \
You will be given the input text, and the headers.
Here's an example on how a chunk of data will be stored.

EXAMPLE TEXT:
# Bill of Lading and Driver Signature

Item 6

The signature of a Carrier Freight Driver/Sales Representative on any Bill of Lading other than a Carrier’s Bill of Lading will act only to acknowledge the receipt of freight as described on the document. This signature will not acknowledge agreement to any terms and conditions of carriage and/or liability conditions that may also appear on the document. Unless there is a written agreement, separate from the Bill of Lading, signed by shipper and Carrier, then the Carrier Freight Bill of Lading Terms and Conditions will apply.

EXAMPLE OUTPUT:
HeaderSection(header="Bill of Lading and Driver Signature Item 6", text="The signature of a Carrier Freight Driver/Sales Representative on any Bill of Lading other than a Carrier’s Bill of Lading will act only to acknowledge the receipt of freight as described on the document. This signature will not acknowledge agreement to any terms and conditions of carriage and/or liability conditions that may also appear on the document. Unless there is a written agreement, separate from the Bill of Lading, signed by shipper and Carrier, then the Carrier Freight Bill of Lading Terms and Conditions will apply.")

NOTE: The item number may not always be present.
"""

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system),
        ("human", "Headers: {headers}\n\nText: {text}")
    ]
)

chunking_chain = prompt | structured_json_output
output = chunking_chain.invoke({"headers": llama_parse_headers_and_items, "text": llama_data_md[0].text})

And receive this error

ValidationError: 1 validation error for AllSections
sections
  field required (type=value_error.missing)

When I set the HeaderSection class as the structured output class, it works and returns one section, but I need it to return all of the sections, which is why I'm trying to use AllSections class, which is suppose to have a list of HeaderSection. Any idea what this error means in this context, and how I can get this to run?

2
  • You should modify the prompt to return a list of HeaderSection. Your example output should have a AllSections with its sections being a list of HeaderSections Commented Jul 9, 2024 at 9:15
  • @InsertCheesyLine I see what you mean by fixing the example output, but what do you mean by return a list of HeaderSection? Like, the .with_structured_output() should take list[HeaderSection]? Commented Jul 9, 2024 at 18:52

1 Answer 1

0

You are passing AllSections to with_structured_output so you need to define an example output that follows your schema.

Eg:

EXAMPLE OUTPUT:

AllSections(
  sections=[
  HeaderSection(header="Bill of Lading ...", text="The signature..."),
  HeaderSection(header="Bill of Lading ...", text="The signature..."),
...)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.