2

I know this post, I've already read it but still I'd like to learn what language does an html parser (may) use? I mean, does it parse the whole source with a regex or it uses a normal programming language such as c# or python?

Apart from the question above can you also brief me on from where I should start to create my own parser? (I'd like to create an html parser for my personal needs :)

10
  • 1
    You can use any Turing complete language. Regular expressions (at least those of formal language theory) aren’t. But most regular expression libraries and implementations are far more capable (see for example Can extended regex implementations parse HTML?). Commented Jul 29, 2011 at 18:12
  • 2
    Be sure to read this: stackoverflow.com/questions/1732348/… A masterpiece of StackOverflow. Commented Jul 29, 2011 at 18:12
  • I've already read that :) @Iterator Commented Jul 29, 2011 at 18:21
  • 1
    @Gumbo: technically, you can use a pushdown automata. You don't need Turing completeness :D. And to Shaokan: the fact that HTML has a Context-Free Grammar makes any traditional programming language quite suitable. There are a variety of tools for building such parsers. I like Antlr with Java (or C# or python). If you want to build such a parser completely by hand, you should consult any reference on compiler implementation. Parsing CFGs is almost always well-discussed in compiler books. Commented Jul 29, 2011 at 20:39
  • 1
    Does this answer your question? Writing an HTML Parser Commented Nov 11, 2020 at 22:01

2 Answers 2

2

Python, Java, and Perl are all fine languages for learning to write an HTML parser. Perl is very pleasant for regular expressions, but that's not what you need for a parser. It is a bit more pleasant to write OO programs in Python or Java. C/C++/C#, etc., are also common, for very fast parsers. However, as a learning exercise, I recommend Python or Java, so that you can compare your work with standard parsers.

Sign up to request clarification or add additional context in comments.

Comments

1

The standard way is to use some Yacc/Lex duet; second makes a code that splits the code into tokens, first builds a code that converts a token stream into some desired structure.

There is also some more tempting option, Ragel. Here you just write a big regexp-like structure capable of matching entire file and define a hooks that will fire when a certain sub-pattern was matched.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.