RateLimitError: Error code: 429 while running a RAG application consisting gpt-4oAPI,Pinecone vector store,AzureAIDocumentIntelligenceLoader

Question

Hi I am currently trying to run a RAG application (FAQ chatbot) which consists of 2 UI one where we can separately upload the files and store its embeddings in PineCone Vector store and another where we can retrieve the embedding from the selected index into the RAG chatbot.I have used gpt-4o paid account (tier-1)(30000 tokens) as my primary llm and AzureAIDocumentIntelligenceLoader to load my PDF files asynchronously (using aload() function) to retrieve a 272 page pdf and chat with it.Even when I just type in 'hi' it says-"'message': 'Request too large for gpt-4o in organization org-wOFxlX2RaRVsbRdbSuZ5iBGM on tokens per min (TPM): Limit 30000, Requested 49634. The input or output tokens must be reduced in order to run successfully. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'" I successfully tried to chat with when loaded with 'PyPDFium2Loader'.The first doubt is how it requested 50000 tokens when I have only typed only 'hi' to the chatbot.The second doubt is even though I added async func to the pdf loader function and time delay while retrieving the responses why am I still getting the error code:429

async def extract_embeddings_upload_index(pdf_path, index_name):
    print(f"Loading PDF from path: {pdf_path}")
    
    # Load PDF documents
    async def lol(pdf_path):
        client= await AzureAIDocumentIntelligenceLoader( api_key="167f20e5ce49431aad891c46e2268696",file_path=pdf_path,api_endpoint="https://rx11.cognitiveservices.azure.com/",api_model="prebuilt-layout",mode="single").aload()
        return client

    txt_docs = await lol(pdf_path)
    #total_pages=txt_docs
    #print(f'{total_pages}')
    #txt_docs = PyPDFium2Loader(pdf_path).load()
    
    # Split documents
    print("Splitting documents...")
    splt_docs = RecursiveCharacterTextSplitter(chunk_size=10000, chunk_overlap=1000)
    docs = splt_docs.split_documents(txt_docs)
    print(f"Split into {len(docs)} chunks")

    # Initialize OpenAI embeddings
    print("Initializing OpenAI embeddings...")
    embeddings = OpenAIEmbeddings(model='text-embedding-ada-002')

    # Upload documents to Pinecone index
    print("Initializing Pinecone Vector Store...")
    dbx = PineconeVectorStore.from_documents(documents=docs, index_name=index_name, embedding=embeddings)
    print(f"Uploaded {len(docs)} documents to Pinecone index '{index_name}'")

def initialize(index_name):
    embeddings = ini_embed()
    print('11')
    dbx = PineconeVectorStore.from_existing_index(index_name=index_name, embedding=embeddings)
    print('12')
    llm = ChatOpenAI(model='gpt-4o', temperature=0.5, max_tokens=3000)
    
   # model_id="meta-llama/Meta-Llama-3-8B"
   #model=AutoModelForCausalLM.from_pretrained(model_id)
    #tokenizer=AutoTokenizer.from_pretrained(model)
    #pipe=pipeline("text-generation",model=model,tokenizer=tokenizer,max_new_tokens=5000)
    repo_id="meta-llama/Llama-2-7b-hf"

    print('13')
    prompt = ini_prompt()
    print('14')
    doc_chain = create_stuff_documents_chain(llm, prompt)
    print('15')
    retriever = dbx.as_retriever()
    print('16')
    ans_retrieval = create_retrieval_chain(retriever, doc_chain)
    print('17')

    

        
    
    # Wrap the retrieval chain with RunnableWithMessageHistory
    conversational_ans_retrieval = RunnableWithMessageHistory(
        ans_retrieval,
        lambda session_id: StreamlitChatMessageHistory(key=session_id),
        input_messages_key="input",
        history_messages_key="chat_history",
        output_messages_key="answer"
    )
    print('17')
    
    print(session_id)
    print('18')
    

    return conversational_ans_retrieval

def run_query(retrieval_chain, input_text):
    st.write('run query')
    try:
        # Generate a response using the retrieval chain
        time.sleep(60)
        response = retrieval_chain.invoke(
            {"input": input_text},
            config={"configurable": {"session_id": f'{session_id}'}}
        )
        
        return response['answer']
    except KeyError as e:
        st.error(f"KeyError occurred: {e}. Check the response structure.")
        return None

Suresh Chikkam · Accepted Answer · 2024-07-17 10:56:26Z

The first doubt is how it requested 50000 tokens when I have only typed only 'hi' to the chatbot.

This is because of retrieving and sending to the model in large number of tokens. It is quickly adding up PDF content or significant chunks in the context for every request.

why am I still getting the error code:429

Please check this link (Comment by @AshokPeddakotla-MSFT)

Here I have implemented retry logic with exponential backoff to handle rate limits gracefully.

Code:

import time
import asyncio
from some_library import AzureAIDocumentIntelligenceLoader, OpenAIEmbeddings, PineconeVectorStore
from some_other_library import RecursiveCharacterTextSplitter, ChatOpenAI, RunnableWithMessageHistory, StreamlitChatMessageHistory

async def extract_embeddings_upload_index(pdf_path, index_name):
    print(f"Loading PDF from path: {pdf_path}")

    async def load_pdf(pdf_path):
        loader = AzureAIDocumentIntelligenceLoader(
            api_key="167f20e5ce49431aad891c46e2268696",
            file_path=pdf_path,
            api_endpoint="https://rx11.cognitiveservices.azure.com/",
            api_model="prebuilt-layout",
            mode="single"
        )
        return await loader.aload()

    # Retry logic with exponential backoff
    max_retries = 5
    retry_delay = 1  # Initial delay in seconds

    for attempt in range(max_retries):
        try:
            txt_docs = await load_pdf(pdf_path)
            break
        except RateLimitError as e:
            print(f"Rate limit exceeded. Retrying in {retry_delay} seconds...")
            time.sleep(retry_delay)
            retry_delay *= 2  # Exponential backoff
    else:
        print("Failed to load PDF after multiple attempts.")
        return

    # Split documents
    print("Splitting documents...")
    splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
    docs = splitter.split_documents(txt_docs)
    print(f"Split into {len(docs)} chunks")

    # Initialize OpenAI embeddings
    print("Initializing OpenAI embeddings...")
    embeddings = OpenAIEmbeddings(model='text-embedding-ada-002')

    # Upload documents to Pinecone index
    print("Initializing Pinecone Vector Store...")
    dbx = PineconeVectorStore.from_documents(documents=docs, index_name=index_name, embedding=embeddings)
    print(f"Uploaded {len(docs)} documents to Pinecone index '{index_name}'")

def initialize(index_name):
    embeddings = OpenAIEmbeddings(model='text-embedding-ada-002')
    dbx = PineconeVectorStore.from_existing_index(index_name=index_name, embedding=embeddings)
    llm = ChatOpenAI(model='gpt-4o', temperature=0.5, max_tokens=3000)

    prompt = ini_prompt()
    doc_chain = create_stuff_documents_chain(llm, prompt)
    retriever = dbx.as_retriever()
    ans_retrieval = create_retrieval_chain(retriever, doc_chain)

    # Wrap the retrieval chain with RunnableWithMessageHistory
    conversational_ans_retrieval = RunnableWithMessageHistory(
        ans_retrieval,
        lambda session_id: StreamlitChatMessageHistory(key=session_id),
        input_messages_key="input",
        history_messages_key="chat_history",
        output_messages_key="answer"
    )

    return conversational_ans_retrieval

def run_query(retrieval_chain, input_text):
    st.write('run query')
    try:
        # Retry logic with exponential backoff
        max_retries = 5
        retry_delay = 1  # Initial delay in seconds

        for attempt in range(max_retries):
            try:
                # Generate a response using the retrieval chain
                time.sleep(60)
                response = retrieval_chain.invoke(
                    {"input": input_text},
                    config={"configurable": {"session_id": f'{session_id}'}}
                )
                return response['answer']
            except RateLimitError as e:
                print(f"Rate limit exceeded. Retrying in {retry_delay} seconds...")
                time.sleep(retry_delay)
                retry_delay *= 2  # Exponential backoff
        else:
            st.error("Failed to retrieve response after multiple attempts.")
            return None
    except KeyError as e:
        st.error(f"KeyError occurred: {e}. Check the response structure.")
        return None

If possible, batch smaller requests together to stay within limits while still getting the necessary data.

Reference:

Rate limit reached for gpt-4 in organization org-XXXX on tokens per min

Thanks for the help , the changes you have made to my code are working

Collectives™ on Stack Overflow

RateLimitError: Error code: 429 while running a RAG application consisting gpt-4oAPI,Pinecone vector store,AzureAIDocumentIntelligenceLoader

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related