0

Lately I'm trying to scrap some data from the web page using C#. My problem is, that in C# when I'm using WebBrowser object to manipulate with the web page, when I navigate to my web page in body I only get:

<body>
    <script language="javascript"   src="com.astron.kapar.WebClient/com.astron.kapar.WebClient.nocache.js"></script>
</body>

But if you go on actual web page https://kapalk1.mavir.hu/kapar/lt-publication.jsp?locale=en_GB and look the source you see there is some tables in body probably because browser loads scripts.

My question is, What is the way in C# to manipulate or deal with that kind of web page? For example to choose some dates and get some data? Is there any good library?

Sorry for bad English.

2
  • 1
    I believe that a possible explanation is that the website filters for user agent and returns to you a different content whether you are using browser or not. I don't have WebBrowser API at hand but may you try to fool the User-Agent header to see what it returns? Commented Jul 13, 2015 at 11:58
  • Update: no, it's like that. I opened with Firefox and looked at the source with CTRL+U and found the very same within the body. The Javascript generates the HTML on load, and is also minified (which means partially obfuscated). You may want to reverse engineer their APIs and make meaningful requests Commented Jul 13, 2015 at 12:01

2 Answers 2

0

You need to use either headless IE, or headless WebKit.

These questions might also be relevant.

Headless browser for C# (.NET)?

c# headless browser with javascript support for crawler

Sign up to request clarification or add additional context in comments.

Comments

0

If you are familiar with javascript, one good solution for scrapping javascript-driven site would be casperjs.

I find casperjs really easy to work with for scrapping javascript-heavy site.

  1. Write a casperjs script to scrap the site with css selectors and send your desired output as JSON to stdout using JSON.Stringify.
  2. Invoke casperjs from C# using ProcessStartInfo. Read from stdout and serialize the json back to POCO.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.