1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171
|
# Monitoring
The most requested feature is visibility into the validation data.
This document covers basic monitoring and health of the toolings
(OctoRPKI and GoRTR) but also advanced features like distributed tracing
and extended logging systems.
While all the tools are optional, we recommend to setup mininmal
monitoring with Prometheus.
The sections below will go from a simple general use-case to
more specific, development-centric data.
You can usually find the tools listed here in use inside many tech companies.
While it may feel superfluous and complex to set them up _just_
for the use of RPKI, it may be fruitful to reach out to development-focused
teams and use previously installed software.
A quick note on the Cloudflare RPKI Dashboard and API as it also
fits in the *monitoring* part of RPKI.
A custom version of OctoRPKI is providing the data behind
[rpki.cloudflare.com](https://rpki.cloudflare.com) and its GraphQL API.
It includes fingerprints, file-specific information, historical data
and also validation status of the BGP data collected.
Unfortunately, the setup is too specific to Cloudflare to be made open-source.
Work is being done in that direction to provide a limited feature-set plug and play
dashboard. A section will be added later on.
## Play with docker-compose
In the `compose` folder, there is a configuration file that can be used
to start a RPKI validation environment with monitoring.
Make sure [docker](https://docs.docker.com/get-docker/) and
[docker-compose](https://docs.docker.com/compose/install/) are installed on your machine.
This should provide an effortless demo of all the pieces fitting together:
- 2 GoRTR (one connected to the local OctoRPKI)
- 1 OctoRPKI
- 1 Prometheus
- 1 Grafana (provisions RPKI dashboard)
- 1 Jaeger
You can start with `docker-compose up`.
The Grafana is available on http://localhost:3000 (user/pass: admin/admin).
Jaeger interface is on http://localhost:16686.
You can connect RTR clients to localhost:8282 and localhost:8283.
## Graphs with Prometheus and Grafana
[Prometheus](https://prometheus.io/docs/introduction/overview/#what-is-prometheus) is an
open-source monitoring and alerting system widely used in the devops community.
Its configuration file indicates HTTP endpoints that Prometheus will scrape on a periodic
basis.
Both OctoRPKI and GoRTR have Prometheus-scrappable endpoints.
If you are running them on your local machine:
* GoRTR http://localhost:8080/metrics
* OctoRPKI http://localhost:8081/metrics
If you look at the data returned, you can find metrics like:
```
rpki_roas{filtered="filtered",ip_version="ipv4",path="https://rpki.cloudflare.com/rpki.json"} 132204
rpki_roas{filtered="filtered",ip_version="ipv6",path="https://rpki.cloudflare.com/rpki.json"} 21923
```
The metric name is `rpki_roas`, the labels are `filtered,ip_version,path` and the value is the last number.
Once Prometheus has scraped it, it's inserted in its database timestampped with the query time.
You can access the data from Grafana using the Prometheus data source.
Grafana is an open-source visualization and analytics software.
The pre-made [dashboard 12501](https://grafana.com/grafana/dashboards/12501)
can be imported and used .
<p align="center">
<img src="resources/monitoring_grafana.png" alt="OctoRPKI Grafana Dashboard" width="600px"/>
</p>
## Error logging with Sentry
[Sentry](https://sentry.io/) is an open-source application that provides
a error and events monitoring. It is also available as a cloud-service.
The advantage of this tool is to provide an interface and search engine for logs.
Stacktrace, URIs or even tags can be used to enrich log events and help sorting and grouping.
It is more flexible than exploring console stdout/stderr messages.
This was shown to be particularly useful when investigating a cryptographically invalid resource
and troubleshoot reachability issues.
It requires Sentry-specific software code in order to provide more information.
OctoRPKI is compatible and uses the official library that wrap the errors.
When a validation error happens, information like file path and certificate key id are
added to the log before being sent to Sentry.
By passing the environment variable `SENTRY_DSN=https://<key>@<organization>.<server>/<project>`,
or the CLI argument `-sentry.dsn https://...` OctoRPKI will connect to the Sentry instance and send its messages.
It alsos include validation failures and fetching informmation (RRDP, rsync).
<p align="center">
<img src="resources/monitoring_sentry_1.png" alt="OctoRPKI Sentry Dashboard Events List" width="600px"/>
</p>
<p align="center">
<img src="resources/monitoring_sentry_2.png" alt="OctoRPKI Sentry Dashboard Detail Page" width="600px"/>
</p>
If you are new to Sentry, to get started, you can setup
[sentry/onpremise](https://github.com/getsentry/onpremise),
which uses docker-compose.
Another solution is to create a account on [sentry.io](https://sentry.io/pricing/).
The free/developer account should allow you to run a validator
within the quotas.
Proceed by creating a Project and obtain DSN to pass to the application.
You can use it with the docker-compose provided:
`SENTRY_DSN=https://... docker-compose up`.
## Distributed tracing with Jaeger
[Distributed tracing](https://opentracing.io/docs/overview/what-is-tracing/)
allows visualization of events with relational graphs and waterfal charts.
It is heavily used for microservices and complex distributed environments.
While a RPKI validator is a monolithic application, it fetches data from
many endpoints. Timing visualizations can help discovering issues
and possible optimizations.
The tracer front-end and library used in this project is
[Jaeger](https://www.jaegertracing.io/), an application developed by Uber.
To enable tracing, pass the flag `-trace=true` to enable Jaeger tracing.
The following [environment variables](https://github.com/jaegertracing/jaeger-client-go#environment-variables)
are required:
- `JAEGER_ENDPOINT=http://jaeger:14268/api/traces`
- `JAEGER_SERVICE_NAME=octorpki`
- `JAEGER_SAMPLER_TYPE=const`
- `JAEGER_SAMPLER_PARAM=1`
- `JAEGER_REPORTER_LOG_SPANS=true`
Once you connect to the dashboard, you will be able to see the status of the validation
and steps/iterations with errors.
<p align="center">
<img src="resources/monitoring_tracing.png" alt="OctoRPKI Jaeger Distributed Tracing" width="600px"/>
</p>
_Please note that some installations may use a specific flavor of open-tracing (eg: different protocols).
Some code changes may be required in order to be made compatible. This is unfortunately not possible
with configuration flags._
## Profiling usage with Pprof
This last part is more focused on the software development than proper operational
monitoring. It can be helpful identifying an issue with the code.
Profiling gives information about resource usage per function calls.
To enable profiling, pass the CLI argument `-pprof=true`.
OctoRPKI web interface will now provide new information on http://localhost:8081/debug/pprof/.
Use `go tool pprof` to connect remotely and open a web interface with charts
```bash
$ go tool pprof -http :8084 http://localhost:8081/debug/pprof/profile
```
<p align="center">
<img src="resources/monitoring_pprof_1.png" alt="OctoRPKI Pprof Heap Memory Graph" width="600px"/>
</p>
<p align="center">
<img src="resources/monitoring_pprof_2.png" alt="OctoRPKI Pprof CPU Flame Graph" width="600px"/>
</p>
|