1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184
|
# Spamcheck
## Project Structure
Spamcheck started as an internal project by and for GitLab, over time it became increasingly clear the community could profit from these efforts and the decision was made to strive towards making much of it public.
At the moment, management and development (issues, merge requests, etc) happen in [the gitlab-com Spamcheck project](https://gitlab.com/gitlab-com/gl-security/engineering-and-research/automation-team/spam/spamcheck) with [the gitlab-org project](https://gitlab.com/gitlab-org/spamcheck) being a code and registry mirror.
## Architecture Diagram
The following diagram gives a high-level overview on how the various components of the anti-spam engine interact with each other:

The basic spamcheck workflow is as follows. If issue metadata meets certain pre-defined conditions then a spam verdict is determined based on custom business logic. If spam status cannot be determined via issue metadata then ML inference is performed on the issue and the confidence ratio of the classification is used to determine the spam verdict.

## Development
### Development Environment
Clone this repo, install dependencies, and run the service. In order to perform ML inference, ensure a classifier is available in the `./classifiers` directory. See the classifiers and configuration section for details.
```bash
git clone git@gitlab.com:gitlab-com/gl-security/engineering-and-research/automation-team/spam/spamcheck.git
cd spamcheck
make deps
cp config/config.example.yml config/config.yml
# Customize config/config.yml if necessary
make run
```
#### Generating gRPC protobuf files
To build the protobuf files when you've made a change:
```bash
make proto
```
To build the Ruby protobuf files
```bash
make proto_ruby
```
### Running in Docker
Build the `spamcheck` docker image.
```bash
docker build -t spamcheck-py .
```
Run the docker image and ensure model files are mounted.
```bash
docker run --rm -p 8001:8001 -v "$(pwd)/classifiers:/spamcheck/classifiers" spamcheck-py
```
## ML Classifiers
ML classifiers are python modules that are used to classify spam and return a score between 0.0-1.0 which corresponds to a probability a given issue is spam.
All classifiers should exist as a module associated with the type of data that is to be classified. For example, code to classify issues should live in the `./classifiers/issue` directory. Each classifier must contain, at a minimum, a `classifier` module that includes a `score` method. The `score` method will receive a dictionary representation of the object to classify and return a float. The directory structure of an issue classifier may look like this.
```
classifiers/
|-- issue/
| |--model/
| | |-- model.tflite
| | |-- tokenizer.pickle
| |--classifier.py
| |--pre_processor.py
```
The ML classifiers are constrained by dependencies included in this repositories `Pipfile` and care should be taken to not build classifiers that need additional dependencies. This implies that the model format used by the classifier it in the `tflite` format. The `docker build` command will automatically install the latest classifier.
```
docker build -t spamcheck-py .
docker run --rm -v "$(pwd)/config:/spamcheck/config" -p8001:8001 spamcheck-py
```
## Configuration
Configuration options can be loaded via a `yaml` config file, environment variables, or CLI arguments. Preference for config options are:
1. CLI arguments
2. Environment variables
3. Config file options
4. Application defaults
|Option|Default Value|Description|
|------|--------------|------------|
|`-c`, `--config`<br/>`SPAMCHECK_CONFIG` | `./config/config.yml` | The path to the spamcheck configuration file. Must be in `yaml` format. |
|`--env`<br/>`SPAMCHECK_ENV` | `development`| The environment spamcheck is running in. When running in production gRPC reflection is disabled. |
|`--grpc-addr`<br/>`SPAMCHECK_GRPC_ADDR ` | `0.0.0.0:8001` | The `HOST:PORT` to bind the spamcheck service to. |
|`--log-level`<br/>`SPAMCHECK_LOG_LEVEL ` | `info` | The application log level. |
|`--ml-classifiers`<br/>`SPAMCHECK_ML_CLASSIFIERS ` | `./classifiers` | Directory location for ML classifiers. |
|`--tls-certificate`<br/>`SPAMCHECK_TLS_CERTIFICATE ` | `./ssl/cert.pem` | The path to the TLS certificate to use for secure connections to spamcheck service. |
|`--tls-private-key`<br/>`SPAMCHECK_TLS_PRIVATE_KEY ` | `./ssl/key.pem` | The path to the TLS private key to use for secure connections to spamcheck service. |
`config/config.yml` has more configuration options. View the [example file](./config/config.example.yml) for details.
## Concept overview
## Linting
Run pylint against spamcheck source code:
```bash
make lint
```
## Testing
Tests are located in the `./tests` directory and mirror the python module layout.
### Test suite
Run the test suite:
```bash
make test
```
### Manually
Start the service via `make run`.
Test the gRPC endpoint with a python gRPC client.
```bash
python client.py
```
Test the gRPC endpoint with [grpcurl](https://github.com/fullstorydev/grpcurl):
```bash
grpcurl -plaintext -d "$(cat examples/checkforspamissue.json)" localhost:8001 spamcheck.SpamcheckService/CheckForSpamIssue
```
By default, `grpcurl` will return an empty object, e.g. `{}`, because Protobufs don't
encode 0-value fields and `SpamVerdict.Verdict` is an enum whose default value is `ALLOW` which is 0.
There is a `/healthz` endpoint used for Kubernetes probes and uptime checks:
```bash
grpcurl -plaintext localhost:8001 spamcheck.SpamcheckService/Healthz
```
## Publishing a new gem version
1. Build new Ruby protobuf files:
```bash
make proto_ruby
```
2. Make sure to update `ruby/spamcheck/version.rb` with the new version number. Keep in mind that `bundle update` uses `--major` by default, so all minor version updates should not break backward or cross-protocol compatibility.
3. Sign up for a [RubyGems account](RubyGems.org) if you already don't have one. You should also be an owner of the [spamcheck gem](https://rubygems.org/gems/spamcheck)
4. Create a tag for spamcheck with the version number e.g. if the version number is 0.1.0, do:
```bash
git tag v0.1.0
```
5. Check that the tag has been correctly created alongside your latest commit:
```bash
git show v0.1.0
```
6. Push the branch with the tag:
```bash
git push --tags origin <your branch>
```
7. Run `bundle exec ruby _support/publish-gem v0.1.0` locally. It should ask you for your rubygems email and password.
8. After a successful push, the new gem version should now be publicly available on [RubyGems.org](https://rubygems.org/gems/spamcheck) and ready to use.
|