1

I'm a beginner programmer so this is probably a trivial question: I have a .html file with a deeply nested unordered list. How can I copy for example the first 4 nesting levels into a new empty .html file in Python? Do I need BeautifulSoup for this? For better illustration here is the code for the display effect in Javascript:

function nestless(root, selector, level) {
    var use = root;
    for (var i = 0; i <= level; i++) {
        use += ' ' + selector;
    }
    $(use).remove();
}

Here I would use:

nestless('#root', 'ul', 4);

It seems that my original question is badly written and difficult to parse, I'm sorry for that. The .html files are not really websites, but rather manually written text documents in a html editor and saved in .html. They contain nothing that couldn't be written with a LaTeX editor.

For example if I wanted to reduce this list list to the first 2 levels:

  • A
  • B
    • C
    • D
      • E
      • F
  • G

to

  • A
  • B
    • C
    • D
  • G

From my own research there are .html parsers via CSS selectors in BeautifulSoup+soupselect, PyQuery or lxml, but I'm not sure what's the easiest way to proceed or where to start reading.

4
  • Sorry i can't get your get question properly. BeautifulSoup do the parsing for xml codes. Commented Jul 20, 2012 at 15:55
  • (1) can we see some of the page structure, especially how the lists are nested? Do non-leaf nodes contain anything in addition to the sub-list? (2) what is it you want back - a nested list of limited depth, or a flat list? Commented Jul 20, 2012 at 16:09
  • The lists are standard <ul> nested lists, in the form of <ul> <li>A</li> <li>B</li> <ul> <li>C</li> <li>D<br> </li> </ul> </ul> Commented Jul 20, 2012 at 16:33
  • ... shouldn't the second <ul></ul> be inside a <li></li>? Commented Jul 20, 2012 at 18:05

2 Answers 2

1

I would look at Mechanize http://wwwsearch.sourceforge.net/mechanize/ to do the html parsing to get to the actual list itself. Try not to use Regex for this as it will become very messy and just make things more difficult.

Sign up to request clarification or add additional context in comments.

Comments

0

You don't need beautifulsoup, but doing it without it would be a pain.

Use it to:

  • find your first level list tag;
  • iterate on the first level;
  • for each element, iterate to the second level;
  • do the same for the third et fourth level.
  • At the fourth level, iterate, deleting any child node.

Keep the object you have in memeroy, and just insert it in the next html object as a child when you generate the new html file.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.