I want to write a program in C/C++ that will dynamically read a web page and extract information from it. As an example imagine if you wanted to write an application to follow and log an ebay auction. Is there an easy way to grab the web page? A library which provides this functionality? And is there an easy way to parse the page to get the specific data?
7 Answers
Have a look at the cURL library:
#include <stdio.h>
#include <curl/curl.h>
int main(void)
{
CURL *curl;
CURLcode res;
curl = curl_easy_init();
if(curl) {
curl_easy_setopt(curl, CURLOPT_URL, "curl.haxx.se");
res = curl_easy_perform(curl);
/* always cleanup */
curl_easy_cleanup(curl);
}
return 0;
}
BTW, if C++ is not strictly required. I encourage you to try C# or Java. It is much easier and there is a built-in way.
7 Comments
Windows code:
#include <winsock2.h>
#include <windows.h>
#include <iostream>
#pragma comment(lib,"ws2_32.lib")
using namespace std;
int main (){
WSADATA wsaData;
if (WSAStartup(MAKEWORD(2,2), &wsaData) != 0) {
cout << "WSAStartup failed.\n";
system("pause");
return 1;
}
SOCKET Socket=socket(AF_INET,SOCK_STREAM,IPPROTO_TCP);
struct hostent *host;
host = gethostbyname("www.google.com");
SOCKADDR_IN SockAddr;
SockAddr.sin_port=htons(80);
SockAddr.sin_family=AF_INET;
SockAddr.sin_addr.s_addr = *((unsigned long*)host->h_addr);
cout << "Connecting...\n";
if(connect(Socket,(SOCKADDR*)(&SockAddr),sizeof(SockAddr)) != 0){
cout << "Could not connect";
system("pause");
return 1;
}
cout << "Connected.\n";
send(Socket,"GET / HTTP/1.1\r\nHost: www.google.com\r\nConnection: close\r\n\r\n", strlen("GET / HTTP/1.1\r\nHost: www.google.com\r\nConnection: close\r\n\r\n"),0);
char buffer[10000];
int nDataLength;
while ((nDataLength = recv(Socket,buffer,10000,0)) > 0){
int i = 0;
while (buffer[i] >= 32 || buffer[i] == '\n' || buffer[i] == '\r') {
cout << buffer[i];
i += 1;
}
}
closesocket(Socket);
WSACleanup();
system("pause");
return 0;
}
3 Comments
gethostbyname(). It should use getaddrinfo() and support IPv4 and IPv6.printf("%.*s", nDataLength, buffer); which is easier, faster, and safer.There is a free TCP/IP library available for Windows that supports HTTP and HTTPS - using it is very straightforward.
CUT_HTTPClient http;
http.GET("http://folder/file.htm", "c:/tmp/process_me.htm");
You can also GET files and store them in a memory buffer (via CUT_DataSource derived classes). All the usual HTTP support is there - PUT, HEAD, etc. Support for proxy servers is a breeze, as are secure sockets.
Comments
Try using a library, like Qt, which can read data from across a network and get data out of an xml document. This is an example of how to read an xml feed. You could use the ebay feed for example.
Comments
You can do it with socket programming, but it's tricky to implement the parts of the protocol needed to reliably fetch a page. Better to use a library, like neon. This is likely to be installed in most Linux distributions. Under FreeBSD use the fetch library.
For parsing the data, because many pages don't use valid XML, you need to implement heuristics, not a real yacc-based parser. You can implement these using regular expressions or a state transition machine. As what you're trying to do involves a lot of trial-and-error you're better off using a scripting language, like Perl. Due to the high network latency you will not see any difference in performance.
2 Comments
It can be done in Multiplatform QT library:
QByteArray WebpageDownloader::downloadFromUrl(const std::string& url)
{
QNetworkAccessManager manager;
QNetworkReply *response = manager.get(QNetworkRequest(QUrl(url.c_str())));
QEventLoop event;
QObject::connect(response, &QNetworkReply::finished, &event, &QEventLoop::quit);
event.exec();
return response->readAll();
}
That data can be e.g. saved to file, or transformed to std::string:
const string webpageText = downloadFromUrl(url).toStdString();
Remember that you need to add
QT += network
to QT project configuration to compile the code.