1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219
|
# Spamcheck
## Project Structure
Spamcheck started as an internal project by and for GitLab, over time it became increasingly clear the community could profit from these efforts and the decision was made to strive towards making much of it public.
At the moment, management and development (issues, merge requests, etc) happen in [the gitlab-com Spamcheck project](https://gitlab.com/gitlab-com/gl-security/engineering-and-research/automation-team/spam/spamcheck) with [the gitlab-org project](https://gitlab.com/gitlab-org/spamcheck) being a code and registry mirror.
## Architecture Diagram
The following diagram gives a high-level overview on how the various components of the anti-spam engine interact with each other:

The basic spamcheck workflow is as follows. If issue metadata meets certain pre-defined conditions then a spam verdict is determined based on custom business logic. If spam status cannot be determined via issue metadata then ML inference is performed on the issue and the confidence ratio of the classification is used to determine the spam verdict.

## Development
### Running spamcheck with GDK
1. [Login to the GitLab Container Registry](https://docs.gitlab.com/ee/user/packages/container_registry/authenticate_with_container_registry.html)
```bash
docker login registry.gitlab.com -u <username> -p <token>
```
1. Run the service locally with docker
```bash
docker pull registry.gitlab.com/gitlab-org/gl-security/security-engineering/security-automation/spam/spamcheck
docker run --rm -p 8001:8001 registry.gitlab.com/gitlab-org/gl-security/security-engineering/security-automation/spam/spamcheck
```
1. [Configure recaptcha](https://docs.gitlab.com/ee/development/spam_protection_and_captcha/exploratory_testing.html) in GDK
1. Enable Spamcheck in GDK
1. Go to Admin -> Settings -> Reporting
1. Under the Spam and Anti-bot Protection section
1. Check `Enable reCAPTCHA`
1. Check `Enable Spam Check via external API endpoint`
1. Set URL of the external Spam Check endpoint to `grpc://localhost:8001`
1. To change the maximum verdict values of spamcheck use the `--max-[TYPE]-verdict` options
```bash
docker run --rm -p 8001:8001 registry.gitlab.com/gitlab-org/gl-security/security-engineering/security-automation/spam/spamcheck --max-generic-verdict block
```
### Development Environment
Clone this repo, install dependencies, and run the service. In order to perform ML inference, ensure a classifier is available in the `./classifiers` directory. See the classifiers and configuration section for details.
```bash
git clone git@gitlab.com:gitlab-com/gl-security/engineering-and-research/automation-team/spam/spamcheck.git
cd spamcheck
make deps
cp config/config.example.yml config/config.yml
# Customize config/config.yml if necessary
make run
```
#### Generating gRPC protobuf files
To build the protobuf files when you've made a change:
```bash
make proto
```
To build the Ruby protobuf files
```bash
make proto_ruby
```
### Local development using JupyterLab
As an alternative, we've created a development environment (using a [Jupyter Docker Stacks](https://jupyter-docker-stacks.readthedocs.io/en/latest/index.html) image) that offers a containerized JupyterLab interface to enable one to code and, run Python files and Jupyter notebooks.
To set this up, follow the instructions [here](https://gitlab.com/gitlab-com/gl-security/engineering-and-research/automation-team/docker/jupyter/-/blob/main/README.md).
### Running in Docker
Build the `spamcheck` docker image.
```bash
docker build -t spamcheck-py .
```
Run the docker image and ensure model files are mounted.
```bash
docker run --rm -p 8001:8001 -v "$(pwd)/classifiers:/spamcheck/classifiers" spamcheck-py
```
## ML Classifiers
ML classifiers are python modules that are used to classify spam and return a score between 0.0-1.0 which corresponds to a probability a given issue is spam.
All classifiers should exist as a module associated with the type of data that is to be classified. For example, code to classify issues should live in the `./classifiers/issue` directory. Each classifier must contain, at a minimum, a `classifier` module that includes a `score` method. The `score` method will receive a dictionary representation of the object to classify and return a float. The directory structure of an issue classifier may look like this.
```
classifiers/
|-- issue/
| |--model/
| | |-- model.tflite
| | |-- tokenizer.pickle
| |--classifier.py
| |--pre_processor.py
```
The ML classifiers are constrained by dependencies included in this repositories `Pipfile` and care should be taken to not build classifiers that need additional dependencies. This implies that the model format used by the classifier it in the `tflite` format. The `docker build` command will automatically install the latest classifier.
```
docker build -t spamcheck-py .
docker run --rm -v "$(pwd)/config:/spamcheck/config" -p8001:8001 spamcheck-py
```
## Configuration
Configuration options can be loaded via a `yaml` config file, environment variables, or CLI arguments. Preference for config options are:
1. CLI arguments
2. Environment variables
3. Config file options
4. Application defaults
|Option|Default Value|Description|
|------|--------------|------------|
|`-c`, `--config`<br/>`SPAMCHECK_CONFIG` | `./config/config.yml` | The path to the spamcheck configuration file. Must be in `yaml` format. |
|`--env`<br/>`SPAMCHECK_ENV` | `development`| The environment spamcheck is running in. When running in production gRPC reflection is disabled. |
|`--gcs-bucket`<br/>`SPAMCHECK_GCS_BUCKET ` | ``None`` | The GCS bucket to store unlabeled spam for future labeling and training. |
|`--google-pubsub-project`<br/>`SPAMCHECK_GOOGLE_PUBSUB_PROJECT ` | ``None`` | The GCP project where the spamcheck PubSub topic resides for publishing spam events. |
|`--google-pubsub-topic`<br/>`SPAMCHECK_GOOGLE_PUBSUB_TOPIC ` | ``spamcheck`` | The GCP PubSub topic to publish spam events to. |
|`--grpc-addr`<br/>`SPAMCHECK_GRPC_ADDR ` | `0.0.0.0:8001` | The `HOST:PORT` to bind the spamcheck service to. |
|`--log-level`<br/>`SPAMCHECK_LOG_LEVEL ` | `info` | The application log level. |
|`--max-generic-verdict`<br/>`SPAMCHECK_MAX_GENERIC_VERDICT ` | `ALLOW` | Maximum verdict to return for generic spammables. |
|`--max-issue-verdict`<br/>`SPAMCHECK_MAX_ISSUE_VERDICT ` | `CONDITIONAL_ALLOW` | Maximum verdict to return for issue spammables. |
|`--max-snippet-verdict`<br/>`SPAMCHECK_MAX_SNIPPET_VERDICT ` | `CONDITIONAL_ALLOW` | Maximum verdict to return for snippet spammables. |
|`--ml-classifiers`<br/>`SPAMCHECK_ML_CLASSIFIERS ` | `./classifiers` | Directory location for ML classifiers. |
|`--tls-certificate`<br/>`SPAMCHECK_TLS_CERTIFICATE ` | `./ssl/cert.pem` | The path to the TLS certificate to use for secure connections to spamcheck service. |
|`--tls-private-key`<br/>`SPAMCHECK_TLS_PRIVATE_KEY ` | `./ssl/key.pem` | The path to the TLS private key to use for secure connections to spamcheck service. |
`config/config.yml` has more configuration options. View the [example file](./config/config.example.yml) for details.
## Concept overview
## Linting
Run pylint against spamcheck source code:
```bash
make lint
```
## Testing
Tests are located in the `./tests` directory and mirror the python module layout.
### Test suite
Run the test suite:
```bash
make test
```
### Manually
Start the service via `make run`.
Test the gRPC endpoint with a python gRPC client.
```bash
python client.py
```
Test the gRPC endpoint with [grpcurl](https://github.com/fullstorydev/grpcurl):
```bash
grpcurl -plaintext -d "$(cat examples/checkforspamissue.json)" localhost:8001 spamcheck.SpamcheckService/CheckForSpamIssue
```
By default, `grpcurl` will return an empty object, e.g. `{}`, because Protobufs don't
encode 0-value fields and `SpamVerdict.Verdict` is an enum whose default value is `ALLOW` which is 0.
There is a `/healthz` endpoint used for Kubernetes probes and uptime checks:
```bash
grpcurl -plaintext localhost:8001 spamcheck.SpamcheckService/Healthz
```
## Publishing a new gem version
1. Build new Ruby protobuf files:
```bash
make proto_ruby
```
2. Make sure to update `ruby/spamcheck/version.rb` with the new version number. Keep in mind that `bundle update` uses `--major` by default, so all minor version updates should not break backward or cross-protocol compatibility.
3. Sign up for a [RubyGems account](RubyGems.org) if you already don't have one. You should also be an owner of the [spamcheck gem](https://rubygems.org/gems/spamcheck)
4. Create a tag for spamcheck with the version number e.g. if the version number is 0.1.0, do:
```bash
git tag v0.1.0
```
5. Check that the tag has been correctly created alongside your latest commit:
```bash
git show v0.1.0
```
6. Push the branch with the tag:
```bash
git push --tags origin <your branch>
```
7. Run `bundle exec ruby _support/publish-gem v0.1.0` locally. It should ask you for your rubygems email and password.
8. After a successful push, the new gem version should now be publicly available on [RubyGems.org](https://rubygems.org/gems/spamcheck) and ready to use.
|