Azure ML Vector Index creation in Prompt Flow UI

Question

I´m trying to create an Azure Search vector index as well in the Azure ML UI (Prompt flow) portal but having an error in the component "LLM - Crack and Chunk Data": My Flow Error Image

The error says: User program failed with BaseRagServiceError: Rag system error

Part of the logs is:

input_data=/mnt/azureml/cr/j/60652b595f69/cap/data-capability/wd/INPUT_input_data
input_glob=**/*
allowed_extensions=.txt,.md,.html,.htm,.py,.pdf,.ppt,.pptx,.doc,.docx,.xls,.xlsx,.csv,.json
chunk_size=1024
chunk_overlap=0
output_chunks=/mnt/azureml/cr/j/606547e361134e058c4829792b595f69/cap/data-capability/wd/output_chunks
data_source_url=azureml://locations/XXXXX/workspaces/04XXXX0/data/vector-index-input-1734572551882/versions/1
document_path_replacement_regex=None
max_sample_files=-1
use_rcts=True
output_format=jsonl
custom_loader=None
doc_intel_connection_id=None
output_title_chunk=None
openai_api_version=None
openai_api_type=None
[2024-12-19 01:43:28] INFO     azureml.rag.crack_and_chunk.crack_and_chunk - ActivityStarted, crack_and_chunk (activity.py:108)
[2024-12-19 01:43:28] INFO     azureml.rag.crack_and_chunk - Processing file: What is prompt flow.pdf (crack_and_chunk.py:127)
/azureml-envs/rag-embeddings/lib/python3.9/site-packages/pypdf/_crypt_providers/_cryptography.py:32: CryptographyDeprecationWarning: ARC4 has been moved to cryptography.hazmat.decrepit.ciphers.algorithms.ARC4 and will be removed from cryptography.hazmat.primitives.ciphers.algorithms in 48.0.0.
  from cryptography.hazmat.primitives.ciphers.algorithms import AES, ARC4
[2024-12-19 01:43:31] INFO     azureml.rag.azureml.rag.documents.chunking - No file_chunks to yield, continuing (chunking.py:237)
[2024-12-19 01:43:31] INFO     azureml.rag.azureml.rag.documents.chunking - No file_chunks to yield, continuing (chunking.py:237)
[2024-12-19 01:43:31] INFO     azureml.rag.crack_and_chunk - [DocumentChunksIterator::filter_extensions] Filtered 0 files out of 1 (crack_and_chunk.py:129)
[2024-12-19 01:43:31] INFO     azureml.rag.crack_and_chunk - [DocumentChunksIterator::filter_extensions] Skipped extensions: {} (crack_and_chunk.py:130)
[2024-12-19 01:43:31] INFO     azureml.rag.crack_and_chunk - [DocumentChunksIterator::filter_extensions] Kept extensions: {
  ".pdf": 1
} (crack_and_chunk.py:133)
[2024-12-19 01:43:31] INFO     azureml.rag.azureml.rag.documents.cracking - [DocumentChunksIterator::crack_documents] Total time to load files: 0.30446887016296387
{
  ".txt": 0.0,
  ".md": 0.0,
  ".html": 0.0,
  ".htm": 0.0,
  ".py": 0.0,
  ".pdf": 1.0,
  ".ppt": 0.0,
  ".pptx": 0.0,
  ".doc": 0.0,
  ".docx": 0.0,
  ".xls": 0.0,
  ".xlsx": 0.0,
  ".csv": 0.0,
  ".json": 0.0
} (cracking.py:381)
[2024-12-19 01:43:31] INFO     azureml.rag.azureml.rag.documents.cracking - [DocumentChunksIterator::crack_documents] Total time to load files: 0.30446887016296387
{
  ".txt": 0.0,
  ".md": 0.0,
  ".html": 0.0,
  ".htm": 0.0,
  ".py": 0.0,
  ".pdf": 1.0,
  ".ppt": 0.0,
  ".pptx": 0.0,
  ".doc": 0.0,
  ".docx": 0.0,
  ".xls": 0.0,
  ".xlsx": 0.0,
  ".csv": 0.0,
  ".json": 0.0
} (cracking.py:381)
[2024-12-19 01:43:31] INFO     azureml.rag.azureml.rag.documents.chunking - [DocumentChunksIterator::split_documents] Total time to split 1 documents into 0 chunks: 0.9676399230957031 (chunking.py:247)
[2024-12-19 01:43:31] INFO     azureml.rag.azureml.rag.documents.chunking - [DocumentChunksIterator::split_documents] Total time to split 1 documents into 0 chunks: 0.9676399230957031 (chunking.py:247)
[2024-12-19 01:43:31] INFO     azureml.rag.crack_and_chunk - Processed 0 files (crack_and_chunk.py:208)
[2024-12-19 01:43:31] INFO     azureml.rag.crack_and_chunk - No chunked documents found in /mnt/azureml/cr/j/606547e361134e058c4829792b595f69/cap/data-capability/wd/INPUT_input_data with glob **/* (crack_and_chunk.py:215)
[2024-12-19 01:43:31] ERROR    azureml.rag.crack_and_chunk.crack_and_chunk - ServiceError: intepreted error = Rag system error, original error = No chunked documents found in /mnt/azureml/cr/j/606547e361134e058c4829792b595f69/cap/data-capability/wd/INPUT_input_data with glob **/*. (exceptions.py:124)
[2024-12-19 01:43:36] ERROR    azureml.rag.crack_and_chunk.crack_and_chunk - crack_and_chunk failed with exception: Traceback (most recent call last):
  File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/azureml/rag/tasks/crack_and_chunk.py", line 229, in main_wrapper
    map_exceptions(main, activity_logger, args, logger, activity_logger)
  File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/azureml/rag/utils/exceptions.py", line 126, in map_exceptions
    raise e
  File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/azureml/rag/utils/exceptions.py", line 118, in map_exceptions
    return func(*func_args, **kwargs)
  File "/azureml-envs/rag-embeddings/lib/python3.9/site-packages/azureml/rag/tasks/crack_and_chunk.py", line 220, in main
    raise ValueError(f"No chunked documents found in {args.input_data} with glob {args.input_glob}.")
ValueError: No chunked documents found in /mnt/azureml/cr/j/606547e361134e058c4829792b595f69/cap/data-capability/wd/INPUT_input_data with glob **/*.
 (crack_and_chunk.py:231) ...................................

I tried with Serverless and Compute instance and is the same result. It seems the chunk is not doing nothing. My file is PDF format file with only one page without images to let it more easy.

Someone has a suggestion? thank you in advanced!!

add the configuration you done for this. also check if the file present in the given source. — Jaya Shankar G S
– Jaya Shankar G S, Commented Dec 20, 2024 at 9:47

Jaya Shankar G S · Accepted Answer · 2024-12-27 09:06:50Z

0

This kind of error comes when there is no content to chunk the document.

Even i got the same error.

enter image description here

I have two text files New Text Document.txt and New Text Document (2).txt, both are empty no content in those and got the error.

You said you have a single page pdf file, the possible reason is the content is not being extracted properly.

So, you try with 3-4 pdf files with proper content also make sure the file is not password protected.

answered Dec 27, 2024 at 9:06

Jaya Shankar G S

8,6282 gold badges6 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Ana Armas Dec 28, 2024 at 12:42

JayashankarGS, Thank you for your suggestion!! I will try it again with other pdf and post the result :))

Ana Armas Jan 2 at 15:47

Hello JayashankarGS, your solution works!! I have added another PDF with more pages and the job finished. Thank you so much!

Collectives™ on Stack Overflow

Azure ML Vector Index creation in Prompt Flow UI

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related