0

I want to read a webpage and split it into chunks to feed a vector database in a RAG pipeline. This webpage has python code examples on it, but I cannot create chunks with that code text, it is ignored by the splitters. I tried both unstructured python package, and HTMLHeaderTextSplitter class (from langchain_text_splitter package) with the same result.

The HTML code I want to parse looks like this:

...
<h2 id="examples_1">Examples</h2>
<h3 id="create-camera">Create camera</h3>
<p>Create a camera. Setting invalid_entity_id as the parent entity will make the camera to be created under the Ego entity, as it must be</p>
<pre><code class="language-python">camera_id = workspace.create_entity( anyverse_platform.WorkspaceEntityType.Camera, &quot;New Camera&quot;, anyverse_platform.invalid_entity_id )
</code></pre>
<hr>
<h3 id="add-resource-to-workspace">Add resource to workspace</h3>
...

The "unstructured" package based script I use to split the webpage into chunks is this:

from unstructured.partition.html import partition_html
elements = partition_html(url=web_path)
element_dict = [el.to_dict() for el in elements]
output_path = os.path.join(output_dir_documentation, 'unstructured.json')
with open(output_path, 'w', encoding='utf-8') as output_file:
    output_file.write(json.dumps(element_dict, indent=2))

The JSON result corresponding to the piece of HTML above:

  {
    "type": "Title",
    "element_id": "e68ee04dff59551b7d1ae07a2f8a00dc",
    "text": "Examples",
    "metadata": {
      "category_depth": 1,
      "page_number": 1,
      "languages": [
        "eng"
      ],
      "parent_id": "2253b75dcb33b928dae76ea64543f053",
      "url": "https://anyverse.gitlab.io/anyversestudio/",
      "filetype": "text/html"
    }
  },
  {
    "type": "Title",
    "element_id": "534a8b35bbd7e5f0b6006d63efe887a9",
    "text": "Create camera",
    "metadata": {
      "category_depth": 2,
      "page_number": 1,
      "languages": [
        "eng"
      ],
      "parent_id": "e68ee04dff59551b7d1ae07a2f8a00dc",
      "url": "https://anyverse.gitlab.io/anyversestudio/",
      "filetype": "text/html"
    }
  },
  {
    "type": "NarrativeText",
    "element_id": "ae9df01594a27733b24d33ca212f2e66",
    "text": "Create a camera. Setting invalid_entity_id as the parent entity will make the camera to be created under the Ego entity, as it must be",
    "metadata": {
      "page_number": 1,
      "languages": [
        "eng"
      ],
      "parent_id": "534a8b35bbd7e5f0b6006d63efe887a9",
      "url": "https://anyverse.gitlab.io/anyversestudio/",
      "filetype": "text/html"
    }
  },
  {
    "type": "Title",
    "element_id": "ec7fe9ae1c7f315580886ee99e826f3c",
    "text": "Add resource to workspace",
    "metadata": {
      "category_depth": 2,
      "page_number": 2,
      "languages": [
        "eng"
      ],
      "parent_id": "e68ee04dff59551b7d1ae07a2f8a00dc",
      "url": "https://anyverse.gitlab.io/anyversestudio/",
      "filetype": "text/html"
    }
  },

As you can see, the python code text is missing.

I also tried langchain for this:

from langchain_text_splitters import HTMLHeaderTextSplitter
splitter = HTMLHeaderTextSplitter(
    headers_to_split_on=[
        ("h1","Header1"),
        ("h2","Header2"),
        ("h3","Header3"),
    ]
)
chunks = splitter.split_text_from_url(web_path)
for index,chunk in enumerate(chunks):
    output_path = os.path.join(output_dir_documentation, f'{index}.txt')
    with open(output_path,"w",encoding="utf-8") as f:
        f.write(str(chunk))

The corresponding chunk for the "Create camera" header is:

page_content='Create a camera. Setting invalid_entity_id as the parent entity will make the camera to be created under the Ego entity, as it must be' metadata={'Header1': 'Scripting', 'Header2': 'Examples', 'Header3': 'Create camera'}

No text from the pre/code HTML tag appears. Of course, I checked out all generated chunks for the webpage, and no text under pre/code tags are in the chunks.

What I am missing here? How can I tune partition_html and/or HTMLHeaderTextSplitter in order to get the text under pre/code HTML tags?

NOTE: I found out that using BeautifulSoup I can get the missing text from the "pre" tags, but this complicates the chunking too much because I just need the title (h1, h2, etc) as a chunking condition. Chunking the no-code part first, and then extracting the code part, to finally merge both somehow doesn't seem the way to go. Specialized tools like langchain and unstructured should be able to handle this pre and/or code HTML tags.

The BeautifulSoup code:

import requests
data = requests.get(web_path)
from bs4 import BeautifulSoup
soup = BeautifulSoup(data.text, 'html.parser')
content = soup.find_all("pre")
3
  • Hi, checking-in in 2025, did you make headsway? Commented Jan 14 at 11:13
  • Nope. Never got an answer from langchain. I opened a ticket in their discord server and apparently they closed the server for good. Commented Jan 29 at 12:02
  • Thanks for responding. I have to write my own splitter Commented Jan 30 at 16:15

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.