File: getURL.html

package info (click to toggle)
r-cran-rcurl 1.95-4.8-2
  • links: PTS, VCS
  • area: main
  • in suites: stretch
  • size: 4,140 kB
  • ctags: 515
  • sloc: ansic: 3,135; xml: 1,734; asm: 993; sh: 12; makefile: 2
file content (371 lines) | stat: -rw-r--r-- 13,097 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html><head><title>R: Download a URI</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<link rel="stylesheet" type="text/css" href="R.css">
</head><body>

<table width="100%" summary="page for getURL"><tr><td>getURL</td><td align="right">R Documentation</td></tr></table>

<h2>Download a URI</h2>

<h3>Description</h3>


<p>These functions download one or more URIs (a.k.a. URLs).
It uses libcurl under the hood to perform the request
and retrieve the response.
There are a myriad of options that can be specified using
the ... mechanism to control the creation and submission
of the request and the processing of the response.
</p>
<p><code>getURLContent</code> has been added as a high-level function
like <code>getURL</code> and <code>getBinaryURL</code> but which 
determines the type of the content being downloaded
by looking at the resulting HTTP header's Content-Type
field. It uses this to determine whether the bytes
are binary or &quot;text&quot;.
</p>
<p>The request supports any of the facilities within the
version of libcurl that was installed.
One can examine these via <code>curlVersion</code>.
</p>
<p><code>getURLContent</code> doesn't perform asynchronous or multiple
concurrent requests at present.
</p>


<h3>Usage</h3>

<pre>
getURL(url, ..., .opts = list(),
        write = basicTextGatherer(.mapUnicode = .mapUnicode),
         curl = getCurlHandle(), async = length(url) &gt; 1,
           .encoding = integer(), .mapUnicode = TRUE)
getURI(url, ..., .opts = list(), 
        write = basicTextGatherer(.mapUnicode = .mapUnicode),
         curl = getCurlHandle(), async = length(url) &gt; 1,
          .encoding = integer(), .mapUnicode = TRUE)
getURLContent(url, ..., curl = getCurlHandle(.opts = .opts), .encoding = NA,
               binary = NA, .opts = list(...),
               header = dynCurlReader(curl, binary = binary,
                                        baseURL = url, isHTTP = isHTTP,
                                         encoding = .encoding),
               isHTTP = length(grep('^[[:space:]]*http', url)) &gt; 0)
</pre>


<h3>Arguments</h3>


<table summary="R argblock">
<tr valign="top"><td><code>url</code></td>
<td>
<p>a string giving the URI</p>
</td></tr>
<tr valign="top"><td><code>...</code></td>
<td>
<p>named values that are interpreted as CURL options
governing the HTTP request.</p>
</td></tr>
<tr valign="top"><td><code>.opts</code></td>
<td>
<p>a named list or <code>CURLOptions</code> object identifying the
curl options for the handle. This is merged with the values of ...
to create the actual options for the curl handle in the request.</p>
</td></tr>  
<tr valign="top"><td><code>write</code></td>
<td>
<p>if explicitly supplied, this is a function that is called
with a single argument each time the the HTTP response handler has
gathered sufficient text. The argument to the function is
a single string.  The default argument provides
both a  function for cumulating this text and is then used
to retrieve it as the return value for this function.
</p>
</td></tr>
<tr valign="top"><td><code>curl</code></td>
<td>
<p>the previously initialized CURL context/handle which can
be used for multiple requests.</p>
</td></tr>
<tr valign="top"><td><code>async</code></td>
<td>
<p>a logical value that determines whether the download
request should be done via asynchronous,concurrent downloading or a serial
download. This really only arises when we are trying to download
multiple URIs in a single call. There are trade-offs between
concurrent and serial downloads, essentially trading CPU cycles
for shorter elapsed times. Concurrent downloads reduce the overall
time waiting for <code>getURI</code>/<code>getURL</code> to return.
</p>
</td></tr>
<tr valign="top"><td><code>.encoding</code></td>
<td>
<p>an integer or a string that explicitly identifies the
encoding of the content that is returned by the HTTP server in its
response to our query. The possible strings are
&lsquo;UTF-8&rsquo; or &lsquo;ISO-8859-1&rsquo;
and the integers should be specified symbolically
as  <code>CE_UTF8</code> and <code>CE_LATIN1</code>.
Note that, by default, the package attempts to process the header of
the HTTP response to determine the encoding. This argument is used
when such information is erroneous and the caller knows the correct
encoding.
The default value leaves the decision  to this default mechanism.
This does however currently involve processing each line/chunk
of the header (with a call to an R function). As a result,
if one knows the encoding for the resulting response,
specifying this avoids this slight overhead which is probably
quite small relative to network latency and speed.
</p>
</td></tr>
<tr valign="top"><td><code>.mapUnicode</code></td>
<td>
<p>a logical value that controls whether the resulting
text is processed to map components of the form \uxxxx to their
appropriate Unicode representation.</p>
</td></tr>
<tr valign="top"><td><code>binary</code></td>
<td>
<p>a logical value indicating whether the caller  knows
whether the resulting content is binary (<code>TRUE</code>) or not
(<code>FALSE</code>) or unknown (<code>NA</code>).
</p>
</td></tr>
<tr valign="top"><td><code>header</code></td>
<td>
<p>this is made available as a parameter of the function
to allow callers to construct different readers for processing
the header and body of the (HTTP) response.
Callers specifying this will typically only adjust the
call to <code>dynCurlReader</code>, e.g. to specify a
function for its <code>value</code> parameter to
control how the body is post-processed.
</p>
</td></tr>
<tr valign="top"><td><code>isHTTP</code></td>
<td>
<p>a logical value that indicates whether the request an
HTTP request. This is used when determining how to process the response.</p>
</td></tr>
</table>


<h3>Value</h3>


<p>If no value is supplied for <code>write</code>,
the result is the text that is the HTTP response.
(HTTP header information is included if the header option for CURL is
set to <code>TRUE</code> and no handler for headerfunction is supplied in
the CURL options.)
</p>
<p>Alternatively, if a value is supplied for the <code>write</code> parameter,
this is returned. This allows the caller to create a handler within
the call and get it back. This avoids having to explicitly create
and assign it and then call <code>getURL</code> and then access the result.
Instead, the 3 steps can be inlined in a single call.
</p>


<h3>Author(s)</h3>

<p>Duncan Temple Lang &lt;duncan@wald.ucdavis.edu&gt;</p>


<h3>References</h3>

<p>Curl homepage <a href="http://curl.haxx.se">http://curl.haxx.se</a></p>


<h3>See Also</h3>


<p><code>getBinaryURL</code>
<code>curlPerform</code>
<code>curlOptions</code>
</p>


<h3>Examples</h3>

<pre>

  omegahatExists = url.exists("http://www.omegahat.net")

   # Regular HTTP
  if(omegahatExists) {
     txt = getURL("http://www.omegahat.net/RCurl/")
        # Then we could parse the result.
     if(require(XML))
         htmlTreeParse(txt, asText = TRUE)
  }


        # HTTPS. First check to see that we have support compiled into
        # libcurl for ssl.
  if(interactive() &amp;&amp; ("ssl" %in% names(curlVersion()$features))
         &amp;&amp; url.exists("https://sourceforge.net/")) {
     txt = tryCatch(getURL("https://sourceforge.net/"),
                    error = function(e) {
                                  getURL("https://sourceforge.net/",
                                            ssl.verifypeer = FALSE)
                              })

  }


     # Create a CURL handle that we will reuse.
  if(interactive() &amp;&amp; omegahatExists) {
     curl = getCurlHandle()
     pages = list()
     for(u in c("http://www.omegahat.net/RCurl/index.html",
                "http://www.omegahat.net/RGtk/index.html")) {
         pages[[u]] = getURL(u, curl = curl)
     }
  }


    # Set additional fields in the header of the HTTP request.
    # verbose option allows us to see that they were included.
  if(omegahatExists)
     getURL("http://www.omegahat.net", httpheader = c(Accept = "text/html", 
                                                      MyField = "Duncan"), 
               verbose = TRUE)



    # Arrange to read the header of the response from the HTTP server as
    # a separate "stream". Then we can break it into name-value
    # pairs. (The first line is the HTTP/1.1 200 Ok or 301 Moved Permanently
    # status line)
  if(omegahatExists) {
     h = basicTextGatherer()
     txt = getURL("http://www.omegahat.net/RCurl/index.html", header= TRUE, headerfunction = h$update, 
                   httpheader = c(Accept="text/html", Test=1), verbose = TRUE) 
     print(paste(h$value(NULL)[-1], collapse=""))
     read.dcf(textConnection(paste(h$value(NULL)[-1], collapse="")))
  }



   # Test the passwords.
  if(omegahatExists) {
     x = getURL("http://www.omegahat.net/RCurl/testPassword/index.html",  userpwd = "bob:duncantl")

       # Catch an error because no authorization
       # We catch the generic HTTPError, but we could catch the more specific "Unauthorized" error
       # type.
      x = tryCatch(getURLContent("http://www.omegahat.net/RCurl/testPassword/index.html"),
                    HTTPError = function(e) {
                                   cat("HTTP error: ", e$message, "\n")
                                })
  }

## Not run: 
  #  Needs specific information from the cookie file on a per user basis
  #  with a registration to the NY times.
  x = getURL("http://www.nytimes.com",
                 header = TRUE, verbose = TRUE,
                 cookiefile = "/home/duncan/Rcookies",
                 netrc = TRUE,
                 maxredirs = as.integer(20),
                 netrc.file = "/home2/duncan/.netrc1",
                 followlocation = TRUE)

## End(Not run)

   if(interactive() &amp;&amp; omegahatExists) {
       d = debugGatherer()
       x = getURL("http://www.omegahat.net", debugfunction = d$update, verbose = TRUE)
       d$value()
   }

    #############################################
    #  Using an option set in R

   if(interactive() &amp;&amp; omegahatExists) {
      opts = curlOptions(header = TRUE, userpwd = "bob:duncantl", netrc = TRUE)
      getURL("http://www.omegahat.net/RCurl/testPassword/index.html", verbose = TRUE, .opts = opts)

         # Using options in the CURL handle.
      h = getCurlHandle(header = TRUE, userpwd = "bob:duncantl", netrc = TRUE)
      getURL("http://www.omegahat.net/RCurl/testPassword/index.html", verbose = TRUE, curl = h)
   }



   # Use a C routine as the reader. Currently gives a warning.
  if(interactive() &amp;&amp; omegahatExists) {
     routine = getNativeSymbolInfo("R_internalWriteTest", PACKAGE = "RCurl")$address
     getURL("http://www.omegahat.net/RCurl/index.html", writefunction = routine)
  }



  # Example
  if(interactive() &amp;&amp; omegahatExists) {
     uris = c("http://www.omegahat.net/RCurl/index.html", "http://www.omegahat.net/RCurl/philosophy.xml")
     txt = getURI(uris)
     names(txt)
     nchar(txt)

     txt = getURI(uris, async = FALSE)
     names(txt)
     nchar(txt)


     routine = getNativeSymbolInfo("R_internalWriteTest", PACKAGE = "RCurl")$address
     txt = getURI(uris, write = routine, async = FALSE)
     names(txt)
     nchar(txt)


         # getURLContent() for text and binary
     x = getURLContent("http://www.omegahat.net/RCurl/index.html")
     class(x)

     x = getURLContent("http://www.omegahat.net/RCurl/data.gz")
     class(x)
     attr(x, "Content-Type")

     x = getURLContent("http://www.omegahat.net/Rcartogram/demo.jpg")
     class(x)
     attr(x, "Content-Type")


     curl = getCurlHandle()
     dd = getURLContent("http://www.omegahat.net/RJSONIO/RJSONIO.pdf",
                        curl = curl,
                        header = dynCurlReader(curl, binary = TRUE,
                                           value = function(x) {
                                                    print(attributes(x)) 
                                                    x}))
   }



  # FTP
  # Download the files within a directory.
if(interactive() &amp;&amp; url.exists('ftp://ftp.wcc.nrcs.usda.gov')) {

   url = 'ftp://ftp.wcc.nrcs.usda.gov/data/snow/snow_course/table/history/idaho/'
   filenames = getURL(url, ftp.use.epsv = FALSE, dirlistonly = TRUE)

      # Deal with newlines as \n or \r\n. (BDR)
      # Or alternatively, instruct libcurl to change \n's to \r\n's for us with crlf = TRUE
      # filenames = getURL(url, ftp.use.epsv = FALSE, ftplistonly = TRUE, crlf = TRUE)
   filenames = paste(url, strsplit(filenames, "\r*\n")[[1]], sep = "")
   con = getCurlHandle( ftp.use.epsv = FALSE)

      # there is a slight possibility that some of the files that are
      # returned in the directory listing and in filenames will disappear
      # when we go back to get them. So we use a try() in the call getURL.
   contents = sapply(filenames[1:5], function(x) try(getURL(x, curl = con)))
   names(contents) = filenames[1:length(contents)]
}
   
</pre>


</body></html>