File: stream.Rmd

package info (click to toggle)
r-cran-rtweet 2.0.0%2Bdfsg-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 18,068 kB
  • sloc: sh: 13; makefile: 2
file content (156 lines) | stat: -rw-r--r-- 5,251 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
---
title: "Live streaming tweets"
subtitle: "rtweet: Collecting Twitter Data"
output:
  rmarkdown::html_vignette:
    fig_caption: true
    code_folding: show
    toc_float:
      collapsed: true
      toc_depth: 3
vignette: >
  %\VignetteIndexEntry{Live streaming tweets}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

## Installing and loading package

Prior to streaming, make sure to install and load rtweet.
This vignette assumes users have already setup app access tokens (see: the "auth" vignette, `vignette("auth", package = "rtweet")`).

```r
## Load rtweet
library(rtweet)
client_as("my_app")
```

## Overview

rtweet makes it possible to capture live streams of Twitter data[^1]. 

[^1]: Till November 2022 it was possible with API v1.1, currently this is no longer possible and uses API v2.

There are two ways of having a stream:

-   [A stream collecting data from a set of rules](https://developer.twitter.com/en/docs/twitter-api/tweets/filtered-stream/api-reference/get-tweets-search-stream), which can be collected via `filtered_stream()`.

-   [A stream of a 1% of tweets published](https://developer.twitter.com/en/docs/twitter-api/tweets/volume-streams/api-reference/get-tweets-sample-stream), which can be collected via `sample_stream()`.

In either case we need to choose how long should the streaming connection hold, and in which file it should be saved to.


```r
## Stream time in seconds so for one minute set timeout = 60
## For larger chunks of time, I recommend multiplying 60 by the number
## of desired minutes. This method scales up to hours as well
## (x * 60 = x mins, x * 60 * 60 = x hours)
## Stream for 5 seconds
streamtime <- 5
## Filename to save json data (backup)
filename <- "rstats.json"
```

## Filtered stream

The filtered stream collects tweets for all rules that are currently active, not just one rule or query.

### Creating rules

Streaming rules in rtweet need a value and a tag.
The value is the query to be performed, and the tag is the name to identify tweets that match a query.
You can use multiple words and hashtags as value, please [read the official documentation](https://developer.twitter.com/en/docs/twitter-api/tweets/filtered-stream/integrate/build-a-rule).
Multiple rules can match to a single tweet.


```r
## Stream rules used to filter tweets
new_rule <- stream_add_rule(list(value = "#rstats", tag = "rstats"))
```

### Listing rules

To know current rules you can use `stream_add_rule()` to know if any rule is currently active:


```r
rules <- stream_add_rule(NULL)
rules
#>   result_count                sent
#> 1            1 2023-03-19 22:04:29
rules(rules)
#>                    id   value    tag
#> 1 1637575790693842952 #rstats rstats
```

With the help of `rules()` the id, value and tag of each rule is provided.

### Removing rules

To remove rules use `stream_rm_rule()`


```r
# Not evaluated now
stream_rm_rule(ids(new_rule))
```

Note, if the rules are not used for some time, Twitter warns you that they will be removed.
But given that `filtered_stream()` collects tweets for all rules, it is advisable to keep the rules list short and clean.

### filtered_stream()

Once these parameters are specified, initiate the stream.
Note: Barring any disconnection or disruption of the API, streaming will occupy your current instance of R until the specified time has elapsed.
It is possible to start a new instance or R ---streaming itself usually isn't very memory intensive--- but operations may drag a bit during the parsing process which takes place immediately after streaming ends.


```r
## Stream election tweets
stream_rstats <- filtered_stream(timeout = streamtime, file = filename, parse = FALSE)
#> Warning: No matching tweets with streaming rules were found in the time provided.
```

If no tweet matching the rules is detected a warning will be issued.

Parsing larger streams can take quite a bit of time (in addition to time spent streaming) due to a somewhat time-consuming simplifying process used to convert a json file into an R object.

Don't forget to clean the streaming rules:


```r
stream_rm_rule(ids(new_rule))
#>                  sent deleted not_deleted
#> 1 2023-03-19 22:04:51       1           0
```

## Sample stream

The `sample_stream()` function doesn't need rules or anything.


```r
stream_random <- sample_stream(timeout = streamtime, file = filename, parse = FALSE)
#> 
 Found 316 records...
 Imported 316 records. Simplifying...
length(stream_random)
#> [1] 316
```


## Saving files

Users may want to stream tweets into json files upfront and parse those files later on.
To do this, simply add `parse = FALSE` and make sure you provide a path (file name) to a location you can find later.

You can also use `append = TRUE` to continue recording a stream into an already existing file.

Currently parsing the streaming data file with `parse_stream()` is not functional.
However, you can read it back in with `jsonlite::stream_in(file)`.

## Returned data object

The parsed object should be the same whether a user parses up-front or from a json file in a later session.

Currently the returned object is a raw conversion of the feed into a nested list depending on the fields and extensions requested.