File: Scanner.Rd

package info (click to toggle)
apache-arrow 23.0.1-1
  • links: PTS
  • area: main
  • in suites: sid
  • size: 76,220 kB
  • sloc: cpp: 654,608; python: 70,522; ruby: 45,964; ansic: 18,742; sh: 7,365; makefile: 669; javascript: 125; xml: 41
file content (73 lines) | stat: -rw-r--r-- 2,692 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/dataset-scan.R
\name{Scanner}
\alias{Scanner}
\alias{ScannerBuilder}
\title{Scan the contents of a dataset}
\description{
A \code{Scanner} iterates over a \link{Dataset}'s fragments and returns data
according to given row filtering and column projection. A \code{ScannerBuilder}
can help create one.
}
\section{Factory}{

\code{Scanner$create()} wraps the \code{ScannerBuilder} interface to make a \code{Scanner}.
It takes the following arguments:
\itemize{
\item \code{dataset}: A \code{Dataset} or \code{arrow_dplyr_query} object, as returned by the
\code{dplyr} methods on \code{Dataset}.
\item \code{projection}: A character vector of column names to select columns or a
named list of expressions
\item \code{filter}: A \code{Expression} to filter the scanned rows by, or \code{TRUE} (default)
to keep all rows.
\item \code{use_threads}: logical: should scanning use multithreading? Default \code{TRUE}
\item \code{...}: Additional arguments, currently ignored
}
}

\section{Methods}{

\code{ScannerBuilder} has the following methods:
\itemize{
\item \verb{$Project(cols)}: Indicate that the scan should only return columns given
by \code{cols}, a character vector of column names or a named list of \link{Expression}.
\item \verb{$Filter(expr)}: Filter rows by an \link{Expression}.
\item \verb{$UseThreads(threads)}: logical: should the scan use multithreading?
The method's default input is \code{TRUE}, but you must call the method to enable
multithreading because the scanner default is \code{FALSE}.
\item \verb{$BatchSize(batch_size)}: integer: Maximum row count of scanned record
batches, default is 32K. If scanned record batches are overflowing memory
then this method can be called to reduce their size.
\item \verb{$schema}: Active binding, returns the \link{Schema} of the Dataset
\item \verb{$Finish()}: Returns a \code{Scanner}
}

\code{Scanner} currently has a single method, \verb{$ToTable()}, which evaluates the
query and returns an Arrow \link{Table}.
}

\examples{
\dontshow{if (arrow_with_dataset() & arrow_with_parquet()) withAutoprint(\{ # examplesIf}
# Set up directory for examples
tf <- tempfile()
dir.create(tf)
on.exit(unlink(tf))

write_dataset(mtcars, tf, partitioning="cyl")

ds <- open_dataset(tf)

scan_builder <- ds$NewScan()
scan_builder$Filter(Expression$field_ref("hp") > 100)
scan_builder$Project(list(hp_times_ten = 10 * Expression$field_ref("hp")))

# Once configured, call $Finish()
scanner <- scan_builder$Finish()

# Can get results as a table
as.data.frame(scanner$ToTable())

# Or as a RecordBatchReader
scanner$ToRecordBatchReader()
\dontshow{\}) # examplesIf}
}