File: tutorial-01-wget.md

package info (click to toggle)
workflow 0.11.10-1
links: PTS, VCS
area: main
in suites: forky, sid
size: 2,744 kB
sloc: cpp: 33,792; ansic: 9,393; makefile: 9; sh: 6
file content (102 lines) | stat: -rw-r--r-- 5,063 bytes
parent folder | download | duplicates (2)
# Creating your first task: wget

# Sample code

[tutorial-01-wget.cc](/tutorial/tutorial-01-wget.cc)

# About wget

wget reads HTTP/HTTPS URLs from stdin, crawls the webpages and then print the content to stdout. It also outputs the HTTP headers of the request and the response to stderr.   
For convenience, wget exits with Ctrl-C, but it will ensure that all resources are completely released first.

# Creating and starting an HTTP task

~~~cpp
WFHttpTask *task = WFTaskFactory::create_http_task(url, REDIRECT_MAX, RETRY_MAX, wget_callback);
protocol::HttpRequest *req = task->get_req();
req->add_header_pair("Accept", "*/*");
req->add_header_pair("User-Agent", "Wget/1.14 (gnu-linux)");
req->add_header_pair("Connection", "close");
task->start();
pause();
~~~

**WFTaskFactory::create\_http\_task()** generates an HTTP task. In [WFTaskFactory.h](/src/factory/WFTaskFactory.h), the prototype is defined as follows:

~~~cpp
WFHttpTask *create_http_task(const std::string& url,
                             int redirect_max, int retry_max,
                             http_callback_t callback);
~~~

The first few parameters are self-explanatory. **http\_callback\_t** is the callback of an HTTP task, which is defined below:

~~~cpp
using http_callback_t = std::function<void (WFHttpTask *)>;
~~~

To put it simply, it’s the funtion that has **Task** as one parameter and does not return any value. You can pass NULL to this callback, indicating that there is no callback. The callback in all tasks follows the same rule.   
Please note that all factory functions do not return failure, so even if the URL is illegal, don't worry that the task is a null pointer. All errors are handled in the callback.   
You can use **task->get\_req()** to get the request of the task. The default method is GET via HTTP/1.1 on long connections. The framework automatically adds request\_uri, Host and other parameters. The framework will add other HTTP header fields automatically according to the actual requirements, including Content-Length or Connection before sending the request. You may also use **add\_header\_pair()** to add your own header. For more interfaces on HTTP messages, please see [HttpMessage.h](/src/protocol/HttpMessage.h).   
**task->start()** starts the task. It’s non-blocking and will not fail. Then the callback of the task will be called. As it’s an asynchronous task, obviously you cannot use the task pointer after **start()**.   
To make the example as simple as possible, call **pause()** after **start()** to prevent the program from exiting. You can press Ctrl-C to exit the program.

# Handling crawled HTTP results

This example demonstrates how to handle the results with a general function. Of course, **std::function** supports more features.

~~~cpp
void wget_callback(WFHttpTask *task)
{
    protocol::HttpRequest *req = task->get_req();
    protocol::HttpResponse *resp = task->get_resp();
    int state = task->get_state();
    int error = task->get_error();

    // handle error states
    ...

    std::string name;
    std::string value;
    // print request to stderr
    fprintf(stderr, "%s %s %s\r\n", req->get_method(), req->get_http_version(), req->get_request_uri());
    protocol::HttpHeaderCursor req_cursor(req);
    while (req_cursor.next(name, value))
        fprintf(stderr, "%s: %s\r\n", name.c_str(), value.c_str());
    fprintf(stderr, "\r\n");
    
    // print response header to stderr
    ...

    // print response body to stdin
    void *body;
    size_t body_len;
    resp->get_parsed_body(&body, &body_len); // always success.
    fwrite(body, 1, body_len, stdout);
    fflush(stdout);
}
~~~

In this callback, the task is generated by the factory.   
You can use **task->get\_state()** and **task->get\_error()** to obtain the running status and the error code of the task respectively. Let's skip the error handling first.  
Use **task->get\_resp()** to get the response of the task, which is slightly different from the request, as they are both derived from HttpMessage.   
Then, use the HttpHeaderCursor to scan the headers of the request and the response. [HttpUtil.h](/src/protocol/HttpUtil.h) contains the definition of the Cursor.

~~~cpp
class HttpHeaderCursor
{
public:
    HttpHeaderCursor(const HttpMessage *message);
    ...
    void rewind();
    ...
    bool next(std::string& name, std::string& value);
    bool find(const std::string& name, std::string& value);
    ...
};
~~~

There should be no doubt about the use of this cursor.   
The next line **resp->get\_parsed\_body()** obtains the HTTP body of the response. This call always returns true when the task is successful, and the body points to the data area.   
The call gets the raw HTTP body, and does not decode the chunk. If you want to decode the chunk, you can use the HttpChunkCursor in [HttpUtil.h](/src/protocol/HttpUtil.h). 
In addition, **find()** will change the pointer inside the cursor. If you want to iterate over the header after you use **find()**, please use **rewind()** to return to the cursor header.