Programmatically reading a web page

Question

I want to write a program in C/C++ that will dynamically read a web page and extract information from it. As an example imagine if you wanted to write an application to follow and log an ebay auction. Is there an easy way to grab the web page? A library which provides this functionality? And is there an easy way to parse the page to get the specific data?

VERY difficult in C/C++. Its annoying enough even in languages that have extensive support for regular expressions, XML parsing, HTTP methods, etc (eg Java). As for Ebay it has an API you should use. — cletus
– cletus, Commented Dec 23, 2008 at 15:03

philant · Accepted Answer · 2008-12-23 15:47:46Z

44

Have a look at the cURL library:

 #include <stdio.h>
 #include <curl/curl.h>

 int main(void)
 {
   CURL *curl;
   CURLcode res;

   curl = curl_easy_init();
   if(curl) {
     curl_easy_setopt(curl, CURLOPT_URL, "curl.haxx.se");
     res = curl_easy_perform(curl);
      /* always cleanup */
    curl_easy_cleanup(curl);
   }
   return 0;
 }

BTW, if C++ is not strictly required. I encourage you to try C# or Java. It is much easier and there is a built-in way.

edited Dec 23, 2008 at 15:47

philant

36.1k11 gold badges74 silver badges114 bronze badges

answered Dec 23, 2008 at 15:05

Gant

29.9k6 gold badges49 silver badges67 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

BlaM Over a year ago

+1 for cURL - I've used cURL in one of my C++ applications and it works great, even with proxies and all other obstacles you might encounter.

Matthew Flaschen Over a year ago

It would be better to return an error if curl is null (in above example).

Piotr Dobrogost Over a year ago

Check out curlpp - C++ wrapper for cURL library

Mike Housky Over a year ago

Thumbs up for suggesting C# or Java. Python is even easier, particularly if you have the Beautiful Soup package installed to help with the parsing.

Chloe Dev Over a year ago

Why is this +1'd and chosen as the answer? Where's the actual document? What does the code do? Blatant copy and paste.

|

Software_Designer · Accepted Answer · 2012-09-11 16:54:01Z

16

Windows code:

#include <winsock2.h>
#include <windows.h>
#include <iostream>
#pragma comment(lib,"ws2_32.lib")
using namespace std;
int main (){
    WSADATA wsaData;
    if (WSAStartup(MAKEWORD(2,2), &wsaData) != 0) {
        cout << "WSAStartup failed.\n";
        system("pause");
        return 1;
    }
    SOCKET Socket=socket(AF_INET,SOCK_STREAM,IPPROTO_TCP);
    struct hostent *host;
    host = gethostbyname("www.google.com");
    SOCKADDR_IN SockAddr;
    SockAddr.sin_port=htons(80);
    SockAddr.sin_family=AF_INET;
    SockAddr.sin_addr.s_addr = *((unsigned long*)host->h_addr);
    cout << "Connecting...\n";
    if(connect(Socket,(SOCKADDR*)(&SockAddr),sizeof(SockAddr)) != 0){
        cout << "Could not connect";
        system("pause");
        return 1;
    }
    cout << "Connected.\n";
    send(Socket,"GET / HTTP/1.1\r\nHost: www.google.com\r\nConnection: close\r\n\r\n", strlen("GET / HTTP/1.1\r\nHost: www.google.com\r\nConnection: close\r\n\r\n"),0);
    char buffer[10000];
    int nDataLength;
    while ((nDataLength = recv(Socket,buffer,10000,0)) > 0){        
        int i = 0;
        while (buffer[i] >= 32 || buffer[i] == '\n' || buffer[i] == '\r') {
            cout << buffer[i];
            i += 1;
        }
    }
    closesocket(Socket);
        WSACleanup();
    system("pause");
    return 0;
}

answered Sep 11, 2012 at 16:54

Software_Designer

8,5973 gold badges27 silver badges28 bronze badges

3 Comments

Kev Over a year ago

Be careful when posting copy and paste boilerplate/verbatim answers to multiple questions, these tend to be flagged as "spammy" by the community. If you're doing this then it usually means the questions are duplicates so flag them as such instead: stackoverflow.com/a/12374407/419

Imbue Over a year ago

This code has serious flaws: 1) It the page is more than 10,000 bytes without non-printable characters, it will read past the end of buffer and seg-fault. 2) If the webpage has a TAB character in it (or other non-printable characters), this code will skip forward up to 10,000 bytes. 3) New code shouldn't use gethostbyname(). It should use getaddrinfo() and support IPv4 and IPv6.

Imbue Over a year ago

The inner while loop can be replaced by printf("%.*s", nDataLength, buffer); which is easier, faster, and safer.

Rob · Accepted Answer · 2008-12-23 18:13:55Z

4

There is a free TCP/IP library available for Windows that supports HTTP and HTTPS - using it is very straightforward.

Ultimate TCP/IP

CUT_HTTPClient http;
http.GET("http://folder/file.htm", "c:/tmp/process_me.htm");

You can also GET files and store them in a memory buffer (via CUT_DataSource derived classes). All the usual HTTP support is there - PUT, HEAD, etc. Support for proxy servers is a breeze, as are secure sockets.

answered Dec 23, 2008 at 18:13

Rob

79.2k57 gold badges162 silver badges200 bronze badges

Comments

Alexander Smirnov · Accepted Answer · 2016-09-23 19:26:43Z

3

Try using a library, like Qt, which can read data from across a network and get data out of an xml document. This is an example of how to read an xml feed. You could use the ebay feed for example.

edited Sep 23, 2016 at 19:26

Alexander Smirnov

4082 silver badges13 bronze badges

answered Dec 23, 2008 at 15:10

Marius

59.2k35 gold badges137 silver badges151 bronze badges

Comments

Diomidis Spinellis · Accepted Answer · 2008-12-23 15:06:39Z

2

You can do it with socket programming, but it's tricky to implement the parts of the protocol needed to reliably fetch a page. Better to use a library, like neon. This is likely to be installed in most Linux distributions. Under FreeBSD use the fetch library.

For parsing the data, because many pages don't use valid XML, you need to implement heuristics, not a real yacc-based parser. You can implement these using regular expressions or a state transition machine. As what you're trying to do involves a lot of trial-and-error you're better off using a scripting language, like Perl. Due to the high network latency you will not see any difference in performance.

answered Dec 23, 2008 at 15:06

Diomidis Spinellis

19.5k6 gold badges68 silver badges85 bronze badges

2 Comments

Daniel Papasian Over a year ago

While they aren't valid XML, many languages have libraries that have HTML parsers, which will let you use a DOM interface to parse an HTML document.

bortzmeyer Over a year ago

Yes, neon is nice too (but most of my experience is with curl, as mentioned in m3rLinEz's answer. Any comparison somewhere?

Johann Gerell · Accepted Answer · 2008-12-30 16:58:57Z

2

You're not mentioning any platform, so I give you an answer for Win32.

One simple way to download anything from the Internet is the URLDownloadToFile with the IBindStatusCallback parameter set to NULL. To make the function more useful, the callback interface needs to be implemented.

answered Dec 30, 2008 at 16:58

Johann Gerell

25.7k11 gold badges77 silver badges128 bronze badges

Comments

baziorek · Accepted Answer · 2021-04-24 17:12:04Z

2

It can be done in Multiplatform QT library:

QByteArray WebpageDownloader::downloadFromUrl(const std::string& url)
{
    QNetworkAccessManager manager;
    QNetworkReply *response = manager.get(QNetworkRequest(QUrl(url.c_str())));
    QEventLoop event;
    QObject::connect(response, &QNetworkReply::finished, &event, &QEventLoop::quit);
    event.exec();
    return response->readAll();
}

That data can be e.g. saved to file, or transformed to std::string:

const string webpageText = downloadFromUrl(url).toStdString();

Remember that you need to add

QT       += network

to QT project configuration to compile the code.

answered Apr 24, 2021 at 17:12

baziorek

2,7554 gold badges33 silver badges54 bronze badges

Collectives™ on Stack Overflow

Programmatically reading a web page

7 Answers 7

7 Comments

3 Comments

Comments

Comments

2 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

7 Comments

3 Comments

Comments

Comments

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related