File: stream.Rmd

package info (click to toggle)
r-cran-rtweet 2.0.0%2Bdfsg-1
links: PTS, VCS
area: main
in suites: forky, sid, trixie
size: 18,068 kB
sloc: sh: 13; makefile: 2
file content (156 lines) | stat: -rw-r--r-- 5,251 bytes
parent folder | download | duplicates (2)
---
title: "Live streaming tweets"
subtitle: "rtweet: Collecting Twitter Data"
output:
  rmarkdown::html_vignette:
    fig_caption: true
    code_folding: show
    toc_float:
      collapsed: true
      toc_depth: 3
vignette: >
  %\VignetteIndexEntry{Live streaming tweets}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

## Installing and loading package

Prior to streaming, make sure to install and load rtweet.
This vignette assumes users have already setup app access tokens (see: the "auth" vignette, `vignette("auth", package = "rtweet")`).

```r
## Load rtweet
library(rtweet)
client_as("my_app")
```

## Overview

rtweet makes it possible to capture live streams of Twitter data[^1]. 

[^1]: Till November 2022 it was possible with API v1.1, currently this is no longer possible and uses API v2.

There are two ways of having a stream:

-   [A stream collecting data from a set of rules](https://developer.twitter.com/en/docs/twitter-api/tweets/filtered-stream/api-reference/get-tweets-search-stream), which can be collected via `filtered_stream()`.

-   [A stream of a 1% of tweets published](https://developer.twitter.com/en/docs/twitter-api/tweets/volume-streams/api-reference/get-tweets-sample-stream), which can be collected via `sample_stream()`.

In either case we need to choose how long should the streaming connection hold, and in which file it should be saved to.


```r
## Stream time in seconds so for one minute set timeout = 60
## For larger chunks of time, I recommend multiplying 60 by the number
## of desired minutes. This method scales up to hours as well
## (x * 60 = x mins, x * 60 * 60 = x hours)
## Stream for 5 seconds
streamtime <- 5
## Filename to save json data (backup)
filename <- "rstats.json"
```

## Filtered stream

The filtered stream collects tweets for all rules that are currently active, not just one rule or query.

### Creating rules

Streaming rules in rtweet need a value and a tag.
The value is the query to be performed, and the tag is the name to identify tweets that match a query.
You can use multiple words and hashtags as value, please [read the official documentation](https://developer.twitter.com/en/docs/twitter-api/tweets/filtered-stream/integrate/build-a-rule).
Multiple rules can match to a single tweet.


```r
## Stream rules used to filter tweets
new_rule <- stream_add_rule(list(value = "#rstats", tag = "rstats"))
```

### Listing rules

To know current rules you can use `stream_add_rule()` to know if any rule is currently active:


```r
rules <- stream_add_rule(NULL)
rules
#>   result_count                sent
#> 1            1 2023-03-19 22:04:29
rules(rules)
#>                    id   value    tag
#> 1 1637575790693842952 #rstats rstats
```

With the help of `rules()` the id, value and tag of each rule is provided.

### Removing rules

To remove rules use `stream_rm_rule()`


```r
# Not evaluated now
stream_rm_rule(ids(new_rule))
```

Note, if the rules are not used for some time, Twitter warns you that they will be removed.
But given that `filtered_stream()` collects tweets for all rules, it is advisable to keep the rules list short and clean.

### filtered_stream()

Once these parameters are specified, initiate the stream.
Note: Barring any disconnection or disruption of the API, streaming will occupy your current instance of R until the specified time has elapsed.
It is possible to start a new instance or R ---streaming itself usually isn't very memory intensive--- but operations may drag a bit during the parsing process which takes place immediately after streaming ends.


```r
## Stream election tweets
stream_rstats <- filtered_stream(timeout = streamtime, file = filename, parse = FALSE)
#> Warning: No matching tweets with streaming rules were found in the time provided.
```

If no tweet matching the rules is detected a warning will be issued.

Parsing larger streams can take quite a bit of time (in addition to time spent streaming) due to a somewhat time-consuming simplifying process used to convert a json file into an R object.

Don't forget to clean the streaming rules:


```r
stream_rm_rule(ids(new_rule))
#>                  sent deleted not_deleted
#> 1 2023-03-19 22:04:51       1           0
```

## Sample stream

The `sample_stream()` function doesn't need rules or anything.


```r
stream_random <- sample_stream(timeout = streamtime, file = filename, parse = FALSE)
#> 
 Found 316 records...
 Imported 316 records. Simplifying...
length(stream_random)
#> [1] 316
```


## Saving files

Users may want to stream tweets into json files upfront and parse those files later on.
To do this, simply add `parse = FALSE` and make sure you provide a path (file name) to a location you can find later.

You can also use `append = TRUE` to continue recording a stream into an already existing file.

Currently parsing the streaming data file with `parse_stream()` is not functional.
However, you can read it back in with `jsonlite::stream_in(file)`.

## Returned data object

The parsed object should be the same whether a user parses up-front or from a json file in a later session.

Currently the returned object is a raw conversion of the feed into a nested list depending on the fields and extensions requested.