File: BiocFileCache.Rmd

package info (click to toggle)
r-bioc-biocfilecache 1.14.0%2Bdfsg-1
  • links: PTS, VCS
  • area: main
  • in suites: bullseye
  • size: 412 kB
  • sloc: sql: 63; sh: 13; makefile: 2
file content (945 lines) | stat: -rw-r--r-- 31,568 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
---
title: "BiocFileCache: Managing File Resources Across Sessions"
author: Lori Shepherd
output:
  BiocStyle::html_document:
    toc: true
    toc_depth: 2
vignette: >
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteIndexEntry{BiocFileCache: Managing File Resources Across Sessions}
    %\VignetteEncoding{UTF-8}
    %\VignetteDepends{rtracklayer}
---

```{r setup, echo=FALSE}
knitr::opts_chunk$set(collapse=TRUE)
```

# Overview

Organization of files on a local machine can be cumbersome. This is especially
true for local copies of remote resources that may periodically require a new
download to have the most updated information available. [BiocFileCache][] is
designed to help manage local and remote resource files stored locally. It
provides a convenient location to organize files and once added to the cache
management, the package provides functions to determine if remote resources are
out of date and require a new download.

## Installation and Loading

`BiocFileCache` is a _Bioconductor_ package and can be installed through
`BiocManager::install()`.

```{r, eval = FALSE}
if (!"BiocManager" %in% rownames(installed.packages()))
     install.packages("BiocManager")
BiocManager::install("BiocFileCache", dependencies=TRUE)
```

After the package is installed, it can be loaded into _R_ workspace by

```{r, results='hide', warning=FALSE, message=FALSE}
library(BiocFileCache)
```

## Creating / Loading the Cache

The initial step to utilizing [BiocFileCache][] in managing files is to create a
cache object specifying a location. We will create a temporary directory for use
with examples in this vignette. If a path is not specified upon creation, the
default location is a directory `~/.BiocFileCache` in the typical user cache
directory as defined by `rappdirs::user_cache_dir()`.

```{r}
path <- tempfile()
bfc <- BiocFileCache(path, ask = FALSE)
```

If the path location exists and has been utilized to store files previously, the
previous object will be loaded with any files saved to the cache. If the path
location does not exist the user will be prompted to create the new directory.
If the session is not interactive to promt the user or the user decides not to
create the directory a temporary directory will be used.

Some utility functions to examine the cache are:

 * `bfccache(bfc)`
 * `length(bfc)`
 * `show(bfc)`
 * `bfcinfo(bfc)`

`bfccache()` will show the cache path. **NOTE**: Because we are using temporary
directories, your path location will be different than shown.

```{r}
bfccache(bfc)
length(bfc)
```

`length()` on a BiocFileCache will show the number of files currently being
tracked by the `BiocFileCache`. For more detailed information on what is store
in the `BiocFileCache` object, there is a show method which will display the
object, object class, cache path, and number of items currently being tracked.

```{r}
bfc
```

`bfcinfo()` will list a table of `BiocFileCache` resource files being tracked in
the cache. It returns a [dplyr][] object of class `tbl_sqlite`.

```{r}
bfcinfo(bfc)
```

The table of resource files includes the following information:

 * `rid`: resource id. Autogenerated. This is a unique identifier automatically
   generated when a resource is added to the cache.
 * `rname`: resource name. This is given by the user when a resource is added to
   the cache. It does not have to be unique and can be updated at anytime. We
   recommend descriptive key words and identifiers.
 * `create_time`: The date and time a resource is added to the cache.
 * `access_time`: The date and time a resource is utilized within the cache. The
   access time is updated when the resource is updated or downloaded.
 * `rpath`: resource path. This is the path to the local file.
 * `rtype`: resource type. Either "local" or "web", indicating if the resource
   has a remote origin.
 * `fpath`: If rtype is "web", this is the link to the remote resource. It will
   be utilized to download the remote data.
 * `last_modified_time`: For a remote resource, the last_modified (if available)
   information for the local copy of the data. This information is checked
   against the remote resource to determine if the local copy is stale and needs
   to be updated. If it is not available or your resource is not a remote
   resource, the last modified time will be marked as NA.
 * `etag`: For a remote resource, the etag (if available) information for the
   local copy of the data. This information is checked against the remote
   resource to determine if the local copy is stale and needs to be updated. If
   it is not available or your resource is not a remote resource, the etag will
   be marked as NA.
 * `expires`: For a remote resource, the expires (if available) information for
   the local copy of the data. This information is checked against the
   `Sys.time` to determine if the local copy needs to be updated. If it is not
   available or your resource is not a remote resource, the expires will be
   marked as NA.

Now that we have created the cache object and location, let's explore adding
files that the cache will manage!

## Adding / Tracking Resources

Now that a `BiocFileCache` object and cache location has been created, files can
be added to the cache for tracking. There are two functions to add a resource to
the cache:

 * `bfcnew()`
 * `bfcadd()`

The difference between the options: `bfcnew()` creates an entry for a resource
and returns a filepath to save to. As there are many types of data that can be
saved in many different ways, `bfcnew()` allows you to save any _R_ data object
in the appropriate manner and still be able to track the saved file. `bfcadd()`
should be utilized when a file already exists or a remote resource is being
accessed.

`bfcnew` takes the `BiocFileCache` object and a user specified `rname` and
returns a path location to save data to. (optionally) you can add the file
extension if you know the type of file that will be saved:

```{r}
savepath <- bfcnew(bfc, "NewResource", ext=".RData")
savepath

## now we can use that path in any save function
m = matrix(1:12, nrow=3)
save(m, file=savepath)

## and that file will be tracked in the cache
bfcinfo(bfc)
```

`bfcadd()` is for existing files or remote resources.  The user will still
specify an `rname` of their choosing but also must specify a path to local file
or web resource as `fpath`. If no `fpath` is given, the default is to assume the
`rname` is also the path location. If the `fpath` is a local file, there are a
few options for the user determined by the `action` argument.  `action` will
allow the user to either `copy` the existing file into the cache directory,
`move` the existing file into the cache directory, or leave the file whereever
it is on the local system yet still track through the cache object `asis`. copy
and move will rename the file to the generated cache file path. If the `fpath`
is a remote source, the source will try to be downloaded, if it is successful it
will save in the cache location and track in the cache object; The original
source will be added to the cache information as `fpath`. If the user does not
want the remote resource to be downloaded initially, the argument
`download=FALSE` may be used to delay the download but add the resource to the
cache. Relative path locations may also be used, specified with
`rtype = "relative"`. This will store a relative location for the file within
the cache; only actions `copy` and `move` are available for relative paths.

First let's use local files:

```{r}
fl1 <- tempfile(); file.create(fl1)
add2 <- bfcadd(bfc, "Test_addCopy", fl1)                 # copy
# returns filepath being tracked in cache
add2
# the name is the unique rid in the cache
rid2 <- names(add2)

fl2 <- tempfile(); file.create(fl2)
add3 <- bfcadd(bfc, "Test2_addMove", fl2, action="move") # move
rid3 <- names(add3)

fl3 <- tempfile(); file.create(fl3)
add4 <- bfcadd(bfc, "Test3_addAsis", fl3, rtype="local",
	       action="asis") # reference
rid4 <- names(add4)

file.exists(fl1)    # TRUE - copied from original location
file.exists(fl2)    # FALSE - moved from original location
file.exists(fl3)    # TRUE - left asis, original location tracked
```

Now let's add some examples with remote sources:

```{r}
url <- "http://httpbin.org/get"
add5 <- bfcadd(bfc, "TestWeb", fpath=url)
rid5 <- names(add5)

url2<- "https://en.wikipedia.org/wiki/Bioconductor"
add6 <- bfcadd(bfc, "TestWeb", fpath=url2)
rid6 <- names(add6)

# add a remote resource but don't initially download
add7 <- bfcadd(bfc, "TestNoDweb", fpath=url2, download=FALSE)
rid7 <- names(add7)
# let's look at our BiocFileCache object now
bfc
bfcinfo(bfc)
```

Now that we are tracking resources, let's explore accessing their information!

## Investigating / Accessing Resources

Before we get into exploring individual resources, a helper function.  Most of
the functions provided require the unique rid[s] assigned to a resource. The
`bfcadd` and `bfcnew` return the path as a named character vector, the name of
the character vector is the rid.  However, you may want to access a resource
that you have added some time ago.

 * `bfcquery()`

`bfcquery()` will take in a key word and search across the `rname`, `rpath`, and
`fpath` for any matching entries. The columns that are searched can be
controlled with the argument `field`.

```{r}
bfcquery(bfc, "Web")

bfcquery(bfc, "copy")

q1 <- bfcquery(bfc, "wiki")
q1
class(q1)
```

As you can see above `bfcquery()`, returns an object of class `tbl_sql` and can
be investiaged further utilizing methods for these classes, such as the package
`dplyr` methods. The `rid` can be seen in the first column of the table to be
used in other functions. To get a quick count of how many objects in the cache
matched the query, use `bfccount()`.

```{r}
bfccount(q1)
```

 * `[`

`[` allows for subsetting of the BiocFileCache object.  The output will be a
BiocFileSubCache object. Users will still be able to query, remove (from the
subset object only), and access resources of the subset, however the resources
cannot be updated.

```{r}
bfcsubWeb = bfc[paste0("BFC", 5:6)]
bfcsubWeb
bfcinfo(bfcsubWeb)
```

There are three methods for retrieving the `BiocFileCache` resource path
location.

 * `[[`
 * `bfcpath()`
 * `bfcrpath()`

The `[[` will access the `rpath` saved in the `BiocFileCache`. Retrieving this
location will return the path to the local version of the resource; allowing the
user to then use this path in any load/read methods most appropriate for the
resource. The `bfcpath()` and `bfcrpath()` both return a named character vector
also displaying the local file that can be used for retrieval. `bfcpath`
requires `rids` while `bfcrpath()` can use `rids` or `rnames` (but not
both). `bfcrpath()` can be used to add a resource into the cache when `rnames
are specified; if the element in `rnames` is not found, it will try and add to
the cache with `bfcadd()`.


```{r}
bfc[["BFC2"]]
bfcpath(bfc, "BFC2")
bfcpath(bfc, "BFC5")
bfcrpath(bfc, rids="BFC5")
bfcrpath(bfc)
bfcrpath(bfc, c("http://httpbin.org/get","Test3_addAsis"))
```

Managing remote resources locally involves knowing when to update the local copy
of the data.

 * `bfcneedsupdate()`

`bfcneedsupdate()` is a method that will check the local copy of the data's
etag and last_modifed time to the etag and last_modified time of the remote
resource as well as an expires time. The cache saves this information when the
web resource is initially added. The expires time is checked against the current
Sys.time to see if the local resource has expired. If so the resource will deem
need to be updated; if unavailable or not expired will check the etag and
last_modified_time. The etag information is used definitively if it is
available, if it is not available it checks the last_modified time. If the
resource does not have a last_modified tag either, it is undetermined. If the
resource has not been download yet, it is `TRUE`.

**Note:** This function does not automatically download the remote source if it
  is out of date.  Please see `bfcdownload()`.

```{r}
bfcneedsupdate(bfc, "BFC5")
bfcneedsupdate(bfc, "BFC6")
bfcneedsupdate(bfc)
```

## Updating Resource Entries or Local Copy of Remote Data

Just as you could access the `rpath`, the local resource path can be set with

 * `[[<-`

The file must exist in order to be replaced in the `BiocFileCache`. If the user
wishes to rename, they must make a copy (or touch) the file first.

```{r}
fileBeingReplaced <- bfc[[rid3]]
fileBeingReplaced

# fl3 was created when we were adding resources
fl3

bfc[[rid3]]<-fl3
bfc[[rid3]]
```

The user may also wish to change the `rname` or `fpath` associated with a
resource in addition to the `rpath`. This can be done with

 * `bfcupdate()`

Again, if changing the `rpath` the file must exist. If a `fpath` is being
updated, the data will be downloaded and the user will be prompted to overwrite
the current file specified in `rpath`. If the user does not want to be prompted
about overwritting of files, `ask=FALSE` may be used.

```{r}
bfcinfo(bfc, "BFC1")
bfcupdate(bfc, "BFC1", rname="FirstEntry")
bfcinfo(bfc, "BFC1")
```

Now let's update a web resource

```{r}
suppressPackageStartupMessages({
    library(dplyr)
})
bfcinfo(bfc, "BFC6") %>% select(rid, rpath, fpath)
bfcupdate(bfc, "BFC6", fpath=url, rname="Duplicate", ask=FALSE)
bfcinfo(bfc, "BFC6") %>% select(rid, rpath, fpath)
```

Lastly, remote resources may require an update if the Data is out of date (See
`bfcneedsupdate()`).  The `bfcdownload` function will attempt to download from
the original resource saved in the cache as `fpath` and overwrite the out of
date file `rpath`

 * `bfcdownload()`

The following confirms that resources need updating, and the performs the update

```{r}
rid <- "BFC5"
test <- !identical(bfcneedsupdate(bfc, rid), FALSE) # 'TRUE' or 'NA'
if (test)
    bfcdownload(bfc, rid, ask=FALSE)
```

## Adding MetaData

The following functions are provided for metadata:

 * `bfcmeta()<-`
 * `bfcmeta()`
 * `bfcmetalist()`
 * `bfcmetaremove()`

Additional metadata can be added as `data.frames` that become tables in the sql
database. The `data.frame` must contain a column `rid` that matches the `rid`
column in the cache. Any metadata added will then be displayed when accessing
the cache. Metadata is added with `bfcmeta()<-`. A table `name` must be provided
as an argument. Users can add multiple metadata tables as long as the names are
unique. Tables may be appended or overwritten using additional arguments
`append=TRUE` or `overwrite=TRUE`.

```{r}
names(bfcinfo(bfc))
meta <- as.data.frame(list(rid=bfcrid(bfc)[1:3], idx=1:3))
bfcmeta(bfc, name="resourceData") <- meta
names(bfcinfo(bfc))
```
The metadata tables that exist can be listed with `bfcmetalist()` and can be
retrieved with `bfcmeta()`.

```{r}
bfcmetalist(bfc)
bfcmeta(bfc, name="resourceData")
```

Lastly, metadata can be removed with `bfcmetaremove()`.

```{r}
bfcmetaremove(bfc, name="resourceData")
```

**Note:**

While quick implementations of all the functions exist where if you
don't specify a BiocFileCache object it will operate on `BiocFileCache()`,
this option is not available for `bfcmeta()<-`. This function must always
specify a BiocFileCache object by first defining a variable and then passing
that variable into the function.

Example of ERROR:
```{r eval=FALSE}
bfcmeta(name="resourceData") <- meta
Error in bfcmeta(name = "resourceData") <- meta :
  target of assignment expands to non-language object
```
Correct implementation:
```{r eval=FALSE}
bfc <- BiocFileCache()
bfcmeta(bfc, name="resourceData") <- meta
```
All other functions have a default, if the BiocFileCache object is missing it
will operate on the default cache `BiocFileCache()`.

## Removing Resources

Now that we have added resources, it is also possible to remove a resource.

 * `bfcremove()`

When you remove a resource from the cache, it will also delete the local file
but only if it is stored in the cache directory as given by `bfccache(bfc)`. If
it is a path to a file somewhere else on the user system, it will only be
removed from the `BiocFileCache` object but the file not deleted.

```{r}
# let's remind ourselves of our object
bfc

bfcremove(bfc, "BFC6")
bfcremove(bfc, "BFC1")

# let's look at our BiocFileCache object now
bfc
```

There is another helper function that may be of use:

 * `bfcsync()`

This function will compare two things:

 1. If any `rpath` cannot be found (This would occur if `bfcnew()` is used and
    the path was not used to save an object)
 2. If there are files in the cache directory (`bfccache(bfc)`), that are not
    being tracked by the `BiocFileCache` object

```{r}
# create a new entry that hasn't been used
path <- bfcnew(bfc, "UseMe")
rmMe <- names(path)
# We also have a file not being tracked because we updated rpath

bfcsync(bfc)

# you can suppress the messages and just have a TRUE/FALSE
bfcsync(bfc, FALSE)

#
# Let's do some cleaning to have a synced object
#
bfcremove(bfc, rmMe)
unlink(fileBeingReplaced)

bfcsync(bfc)
```

## Exporting and Importing Cache

There is a helper function to export a BiocFileCache and associated files as a
tar or zip archive as well as the appropriate import function.

 * `exportbfc()`
 * `importbfc()`

The `exportbfc` function will take in a BiocFileCache object or subsetted object
and create a tar or zip archive that can then be shared to other collaborators
on different computer systems. The user can choose where the archive is created
with `outputFile`; the current working directory and the name
`BiocFileCacheExport.tar` is used as default. By default a tar archive is
created, but the user can create a zip archive instead using the argument
`outputMethod="zip"`. Any additional argument to the `utils::zip` or
`utils::tar` may also be utilized.

The following are some example calls:
```{r eval=FALSE}
# export entire biocfilecache
exportbfc(bfc)

# export the first 4 entries of biocfilecache
# as a compressed tar
exportbfc(bfc, rids=paste0("BFC", 1:4),
	  outputFile="BiocFileCacheExport.tar.gz", compression="gzip")

# export the subsetted object of web resources as zip
sub1 <- bfc[bfcrid(bfcquery(bfc, "web", field='rtype'))]
exportbfc(sub1, outputFile = "BiocFileCacheExportWeb.zip",
	  outMethod="zip")
```

The archive once inflated on a users system will have a fully functional copy of
the sent cache. The archive can be extracted manually and the path used in the
constructor `BiocFileCache()` or for convenience the function `importbfc` may be
utilized. The `importbfc` function takes in a path to the appropriate tar or zip
file, the argument `archiveMethod` indicating if `untar` or `unzip` should be
used (the default is untar), a path to where the archive should be extracted to
as `exdir`, and any additional arguments to the `utils::untar` and
`utils::unzip` methods. The function will extract the files and load the
associated BiocFileCache object into the R session.

The following are example calls to load the above example exported objects:
```{r eval=FALSE}

bfc <- importbfc("BiocFileCacheExport.tar")

bfc2 <- importbfc("BiocFileCacheExport.tar.gz", compression="gzip")

bfc3 <- importbfc("BiocFileCacheExportWeb.zip", archiveMethod="unzip")
```

## Creating a Cache from Existing Data

There exists the following helper functions to convert existing data to a
BiocFileCache:

 * `makeBiocFileCacheFromDataFrame`

These functions may take awhile to run if there are a lot of resources, however
if the BiocFileCache is stored in a permanent location it will only need to be
run once.

### Create a BiocFileCache from an Existing data.frame

`makeBiocFileCacheFromDataFrame` takes an existing data.frame and creates a
BiocFileCache object. The cache location can be specified by the `cache`
argument. The `cache` must not already exist and the user will be prompted to
create the location. If the user opts 'N', the cache will be created in a
temporary directory and this function will have to be run again upon a new R
session. The original data.frame must contain the required BiocFileCache columns
`rtype`, `rpath`, and `fpath` as described in the section 1.2 "Creating /
Loading the Cache". The optional columns `rname`, `last_modified_time`, `etag`
and `expires` may also be specified in the original data.frame although are not
required and will be populated with defaults if missing. For resources with
`rtype="local"`, the `actionLocal` will control if the local copy of the file is
copied or moved to the cache location, or if it is left asis on the local
system; A local copy of the file must exist if the resource is identified as
`rtype=local`. For resources with `rtype="web"`, `actionWeb` will control if the
local copy of the remote file is copied or moved to the cache location. It is a
requirement of BiocFileCache that all remote resources download their local copy
to the cache location. A local copy of the file does not have to exist and can
be downloaded into the cache at a later time. Any additional columns of the
original data.frame besides those required or optional BiocFileCache columns,
are separated and added to the BiocFileCache as a meta data table with the name
given as `metadataName`. See section 1.6 on "Adding Metadata".

The following is an example data.frame with minimal columns 'rtype', 'rpath',
and 'fpath' and one additional column that will become metadata 'keywords'. The
'rpath' can be `NA` as these are remote resources (`rtype='web'`) that have not
been downloaded yet.

```{r}
tbl <- data.frame(rtype=c("web","web"),
		      rpath=c(NA_character_,NA_character_),
		  fpath=c("http://httpbin.org/get",
			  "https://en.wikipedia.org/wiki/Bioconductor"),
		      keywords = c("httpbin", "wiki"), stringsAsFactors=FALSE)
tbl
```

```{r eval=FALSE}

newbfc <- makeBiocFileCacheFromDataFrame(tbl,
					 cache=file.path(tempdir(),"BFC"),
					 actionWeb="copy",
					 actionLocal="copy",
					 metadataName="resourceMetadata")

```

## Cleaning or Removing Cache

Finally, there are two function involved with cleaning or deleting the cache:

 * `cleanbfc()`
 * `removebfc()`

`cleanbfc()` will evaluate the resources in the `BiocFileCache` object and
determine which, if any, have not been created, redownloaded, or updated in a
specified number of days. If `ask=TRUE`, each entry that is above that threshold
will ask if it should be removed from the cache object and the file deleted
(only deleted if in `bfccache(bfc)` location). If `ask=FALSE`, it does not ask
about each file and automatically removes and deletes the file. The default
number of days is 120. If a resource has not needed any updates, this function
could give a false positive. It is also does not take into account how many time
the resource was loaded by retrieving the path (ie. via [[, bfcpath, bfcrpath),
so may not be an accurate indication of how often the resource is
utilized. Please use this function with caution.

```{r eval=FALSE}
cleanbfc(bfc)
```

`removebfc()` will remove the `BiocFileCache` complete from the system. Any
files saved in `bfccache(bfc)` directory will also be deleted.

```{r eval=FALSE}
removebfc(bfc)
```
**Note** Use with caution!

# Use Cases

## Local cache of an internet resource

One use for [BiocFileCache][] is to save local copies of remote
resources. The benefits of this approach include reproducibility,
faster access, and access (once cached) without need for an internet
connection. An example is an Ensembl GTF file (also available via
[AnnotationHub][])

```{r}
## paste to avoid long line in vignette
url <- paste(
    "ftp://ftp.ensembl.org/pub/release-71/gtf",
    "homo_sapiens/Homo_sapiens.GRCh37.71.gtf.gz",
    sep="/")
```

For a system-wide cache, simply load the [BiocFileCache][] package and
ask for the local resource path (`rpath`) of the resource.

```{r, eval=FALSE}
library(BiocFileCache)
bfc <- BiocFileCache()
path <- bfcrpath(bfc, url)
```

Use the path returned by `bfcrpath()` as usual, e.g.,

```{r, eval=FALSE}
gtf <- rtracklayer::import.gff(path)
```

A more compact use, the first or any time, is

```{r, eval=FALSE}
gtf <- rtracklayer::import.gff(bfcrpath(BiocFileCache(), url))
```

Ensembl releases do not change with time, so there is no need to check
whether the cached resource needs to be updated.

## Cache of experimental computations

One might use [BiocFileCache][] to cache results from experimental
analysis. The `rname` field provides an opportunity to provide
descriptive metadata to help manage collections of resources, without
relying on cryptic file naming conventions.

Here we create or use a local file cache in the directory in which we are
doing our analysis.

```{r, eval=FALSE}
library(BiocFileCache)
bfc <- BiocFileCache("~/my-experiment/results")
```

We perform our analysis...

```{r, eval=FALSE}
suppressPackageStartupMessages({
    library(DESeq2)
    library(airway)
})
data(airway)
dds <- DESeqDataData(airway, design = ~ cell + dex)
result <- DESeq(dds)
```

...and then save our result in a location provided by
[BiocFileCache][].

```{r, eval=FALSE}
saveRDS(result, bfcnew(bfc, "airway / DESeq standard analysis"))
```

Retrieve the result at a later date

```{r, eval=FALSE}
result <- readRDS(bfcrpath(bfc, "airway / DESeq standard analysis"))
```

Once might imagine the following workflow:

```{r eval=FALSE}
suppressPackageStartupMessages({
    library(BiocFileCache)
    library(rtracklayer)
})

# load the cache
path <- file.path(tempdir(), "tempCacheDir")
bfc <- BiocFileCache(path)

# the web resource of interest
url <- "ftp://ftp.ensembl.org/pub/release-71/gtf/homo_sapiens/Homo_sapiens.GRCh37.71.gtf.gz"

# check if url is being tracked
res <- bfcquery(bfc, url)

if (bfccount(res) == 0L) {

    # if it is not in cache, add
    ans <- bfcadd(bfc, rname="ensembl, homo sapien", fpath=url)

} else {

  # if it is in cache, get path to load
  rid = res %>% filter(fpath == url) %>% collect(Inf) %>% `[[`("rid")
  ans <- bfcrpath(bfc, rid)

  # check to see if the resource needs to be updated
  check <- bfcneedsupdate(bfc, rid)
  # check can be NA if it cannot be determined, choose how to handle
  if (is.na(check)) check <- TRUE
  if (check){
    ans < - bfcdownload(bfc, rid)
  }
}

# ans is the path of the file to load
ans

# we know because we search for the url that the file is a .gtf.gz,
# if we searched on other terms we can use 'bfcpath' to see the
# original fpath to know the appropriate load/read/import method
bfcpath(bfc, names(ans))

temp = GTFFile(ans)
info = import(temp)
```

```{r eval=TRUE}

#
# A simplier test to see if something is in the cache
# and if not start tracking it is using `bfcrpath`
#

suppressPackageStartupMessages({
    library(BiocFileCache)
    library(rtracklayer)
})

# load the cache
path <- file.path(tempdir(), "tempCacheDir")
bfc <- BiocFileCache(path, ask=FALSE)

# the web resources of interest
url <- "ftp://ftp.ensembl.org/pub/release-71/gtf/homo_sapiens/Homo_sapiens.GRCh37.71.gtf.gz"

url2 <- "ftp://ftp.ensembl.org/pub/release-71/gtf/rattus_norvegicus/Rattus_norvegicus.Rnor_5.0.71.gtf.gz"

# if not in cache will download and create new entry
pathsToLoad <- bfcrpath(bfc, c(url, url2))

pathsToLoad

# now load files as see fit
info = import(GTFFile(pathsToLoad[1]))
class(info)
summary(info)
```

```{r eval=FALSE}
#
# One could also imagine the following:
#

library(BiocFileCache)

# load the cache
bfc <- BiocFileCache()

#
# Do some work!
#

# add a location in the cache
filepath <- bfcnew(bfc, "R workspace")

save(list = ls(), file=filepath)

# now the R workspace is being tracked in the cache
```

## Cache to manage package data

A package may desire to use BiocFileCache to manage remote data. The following
is example code providing some best practice guidelines.

1. Creating the cache

Assumingly, the cache could potentially be called in a variety of places within
code, examples, and vignette. It is desirable to have a wrapper to the
BiocFileCache constructor. The following is a suggested example for a package
called `MyNewPackage`:

```{r, eval=FALSE}
.get_cache <-
    function()
{
    cache <- rappdirs::user_cache_dir(appname="MyNewPackage")
    BiocFileCache::BiocFileCache(cache)
}
```
Essentially this will create a unique cache for the package. If run
interactively, the user will have the option to permanently create the package
cache, else a temporary directory will be used.

2. Resources in the cache

Managing remote resources then involves a function that will query to see if the
resource has been added, if it is not it will add to the cache and if it has it
checks if the file needs to be updated.

```{r, eval=FALSE}
download_data_file <-
    function( verbose = FALSE )
{
    fileURL <- "http://a_path_to/someremotefile.tsv.gz"

    bfc <- .get_cache()
    rid <- bfcquery(bfc, "geneFileV2", "rname")$rid
    if (!length(rid)) {
	 if( verbose )
	     message( "Downloading GENE file" )
	 rid <- names(bfcadd(bfc, "geneFileV2", fileURL ))
    }
    if (!isFALSE(bfcneedsupdate(bfc, rid)))
	bfcdownload(bfc, rid)

    bfcrpath(bfc, rids = rid)
}
```

## Processing web resources before caching

A case has been identified where it may be desired to do some
processing of web-based resources before saving the resource in the
cache. This can be done through specific options of the `bfcadd()` and
`bfcdownload()` functions.

1. Add the resource with `bfcadd()` using the `download=FALSE` argument.
2. Download the resource with `bfcdownload()` using the `FUN` argument.

The `FUN` argument is the name of a function to be applied before
saving the downloaded file into the cache.  The default is
`file.rename`, simply copying the downloaded file into the cache. A
user-supplied function must take ONLY two arguments. When invoked, the
arguments will be:

1. `character(1)` A temporary file containing the resource as
   retrieved from the web.
2. `character(1)` The BiocFileCache location where the processed file
   should be saved.

The function should return a `TRUE` on success or a `character(1)`
description for failure on error. As an example:

```{r}
url <- "http://bioconductor.org/packages/stats/bioc/BiocFileCache/BiocFileCache_stats.tab"

headFile <-                         # how to process file before caching
    function(from, to)
{
    dat <- readLines(from)
    writeLines(head(dat), to)
    TRUE
}

rid <- bfcquery(bfc, url, "fpath")$rid
if (!length(rid))                   # not in cache, add but do not download
    rid <- names(bfcadd(bfc, url, download = FALSE))

update <- bfcneedsupdate(bfc, rid)  # TRUE if newly added or stale
if (!isFALSE(update))               # download & process
    bfcdownload(bfc, rid, ask = FALSE, FUN = headFile)

rpath <- bfcrpath(bfc, rids=rid)    # path to processed result
readLines(rpath)                    # read processed result
```

Note: By default bfcadd uses the webfile name as the saved local file. If the
processing step involves saving the data in a different format, utilize the
bfcadd argument `ext` to assign an extension to identify the type of file that
was saved.
For example
```
url = "http://httpbin.org/get"
bfcadd("myfile", url, download=FALSE)
# would save a file `<uniqueid>_get` in the cache
bfcadd("myfile", url, download=FALSE, ext=".Rdata")
# would save a file `<uniqueid>_get.Rdata` in the cache
```


# Summary

It is our hope that this package allows for easier management of local and
remote resources.

[BiocFileCache]: https://bioconductor.org/packages/BiocFileCache
[dplyr]: https://cran.r-project.org/package=dplyr