File: README.md

package info (click to toggle)
ruby-spamcheck 1.10.1-2
links: PTS, VCS
area: contrib
in suites: sid, trixie
size: 668 kB
sloc: python: 1,261; ruby: 484; makefile: 54; sh: 13
file content (219 lines) | stat: -rw-r--r-- 9,386 bytes
# Spamcheck

## Project Structure

Spamcheck started as an internal project by and for GitLab, over time it became increasingly clear the community could profit from these efforts and the decision was made to strive towards making much of it public.
At the moment, management and development (issues, merge requests, etc) happen in [the gitlab-com Spamcheck project](https://gitlab.com/gitlab-com/gl-security/engineering-and-research/automation-team/spam/spamcheck) with [the gitlab-org project](https://gitlab.com/gitlab-org/spamcheck) being a code and registry mirror.

## Architecture Diagram

The following diagram gives a high-level overview on how the various components of the anti-spam engine interact with each other:

![](docs/architecture.drawio.png)

The basic spamcheck workflow is as follows. If issue metadata meets certain pre-defined conditions then a spam verdict is determined based on custom business logic. If spam status cannot be determined via issue metadata then ML inference is performed on the issue and the confidence ratio of the classification is used to determine the spam verdict.

![](docs/workflow.drawio.png)

## Development

### Running spamcheck with GDK

1. [Login to the GitLab Container Registry](https://docs.gitlab.com/ee/user/packages/container_registry/authenticate_with_container_registry.html)
    ```bash
       docker login registry.gitlab.com -u <username> -p <token>
    ```
1. Run the service locally with docker
    ```bash
    docker pull registry.gitlab.com/gitlab-org/gl-security/security-engineering/security-automation/spam/spamcheck
    docker run --rm -p 8001:8001 registry.gitlab.com/gitlab-org/gl-security/security-engineering/security-automation/spam/spamcheck
    ```
1. [Configure recaptcha](https://docs.gitlab.com/ee/development/spam_protection_and_captcha/exploratory_testing.html) in GDK
1. Enable Spamcheck in GDK
    1. Go to Admin -> Settings -> Reporting
    1. Under the Spam and Anti-bot Protection section
        1. Check `Enable reCAPTCHA`
        1. Check `Enable Spam Check via external API endpoint`
        1. Set URL of the external Spam Check endpoint to `grpc://localhost:8001`
1. To change the maximum verdict values of spamcheck use the `--max-[TYPE]-verdict` options
    ```bash
    docker run --rm -p 8001:8001 registry.gitlab.com/gitlab-org/gl-security/security-engineering/security-automation/spam/spamcheck --max-generic-verdict block
    ```

### Development Environment

Clone this repo, install dependencies, and run the service. In order to perform ML inference, ensure a classifier is available in the `./classifiers` directory. See the classifiers and configuration section for details.

```bash
git clone git@gitlab.com:gitlab-com/gl-security/engineering-and-research/automation-team/spam/spamcheck.git
cd spamcheck
make deps
cp config/config.example.yml config/config.yml
# Customize config/config.yml if necessary
make run
```

#### Generating gRPC protobuf files

To build the protobuf files when you've made a change:

```bash
make proto
```

To build the Ruby protobuf files

```bash
make proto_ruby
```

### Local development using JupyterLab

As an alternative, we've created a development environment (using a [Jupyter Docker Stacks](https://jupyter-docker-stacks.readthedocs.io/en/latest/index.html) image) that offers a containerized JupyterLab interface to enable one to code and, run Python files and Jupyter notebooks.

To set this up, follow the instructions [here](https://gitlab.com/gitlab-com/gl-security/engineering-and-research/automation-team/docker/jupyter/-/blob/main/README.md).

### Running in Docker

Build the `spamcheck` docker image.

```bash
docker build -t spamcheck-py .
```

Run the docker image and ensure model files are mounted.

```bash
docker run --rm -p 8001:8001 -v "$(pwd)/classifiers:/spamcheck/classifiers" spamcheck-py
```

## ML Classifiers

ML classifiers are python modules that are used to classify spam and return a score between 0.0-1.0 which corresponds to a probability a given issue is spam.

All classifiers should exist as a module associated with the type of data that is to be classified. For example, code to classify issues should live in the `./classifiers/issue` directory. Each classifier must contain, at a minimum, a `classifier` module that includes a `score` method. The `score` method will receive a dictionary representation of the object to classify and return a float. The directory structure of an issue classifier may look like this.

```
classifiers/
|-- issue/
|   |--model/
|   |   |-- model.tflite
|   |   |-- tokenizer.pickle
|   |--classifier.py
|   |--pre_processor.py
```

The ML classifiers are constrained by dependencies included in this repositories `Pipfile` and care should be taken to not build classifiers that need additional dependencies. This implies that the model format used by the classifier it in the `tflite` format. The `docker build` command will automatically install the latest classifier.

```
docker build -t spamcheck-py .
docker run --rm -v "$(pwd)/config:/spamcheck/config" -p8001:8001 spamcheck-py
```


## Configuration

Configuration options can be loaded via a `yaml` config file, environment variables, or CLI arguments. Preference for config options are:

1.  CLI arguments
2.  Environment variables
3.  Config file options
4.  Application defaults

|Option|Default Value|Description|
|------|--------------|------------|
|`-c`, `--config`<br/>`SPAMCHECK_CONFIG` | `./config/config.yml` | The path to the spamcheck configuration file. Must be in `yaml` format. |
|`--env`<br/>`SPAMCHECK_ENV`    | `development`| The environment spamcheck is running in. When running in production gRPC reflection is disabled. |
|`--gcs-bucket`<br/>`SPAMCHECK_GCS_BUCKET ` | ``None`` | The GCS bucket to store unlabeled spam for future labeling and training. |
|`--google-pubsub-project`<br/>`SPAMCHECK_GOOGLE_PUBSUB_PROJECT ` | ``None`` | The GCP project where the spamcheck PubSub topic resides for publishing spam events. |
|`--google-pubsub-topic`<br/>`SPAMCHECK_GOOGLE_PUBSUB_TOPIC ` | ``spamcheck`` | The GCP PubSub topic to publish spam events to. |
|`--grpc-addr`<br/>`SPAMCHECK_GRPC_ADDR ` | `0.0.0.0:8001` | The `HOST:PORT` to bind the spamcheck service to. |
|`--log-level`<br/>`SPAMCHECK_LOG_LEVEL ` | `info` | The application log level. |
|`--max-generic-verdict`<br/>`SPAMCHECK_MAX_GENERIC_VERDICT ` | `ALLOW` | Maximum verdict to return for generic spammables. |
|`--max-issue-verdict`<br/>`SPAMCHECK_MAX_ISSUE_VERDICT ` | `CONDITIONAL_ALLOW` | Maximum verdict to return for issue spammables. |
|`--max-snippet-verdict`<br/>`SPAMCHECK_MAX_SNIPPET_VERDICT ` | `CONDITIONAL_ALLOW` | Maximum verdict to return for snippet spammables. |
|`--ml-classifiers`<br/>`SPAMCHECK_ML_CLASSIFIERS ` | `./classifiers` | Directory location for ML classifiers. |
|`--tls-certificate`<br/>`SPAMCHECK_TLS_CERTIFICATE ` | `./ssl/cert.pem` | The path to the TLS certificate to use for secure connections to spamcheck service. |
|`--tls-private-key`<br/>`SPAMCHECK_TLS_PRIVATE_KEY ` | `./ssl/key.pem` | The path to the TLS private key to use for secure connections to spamcheck service. |

`config/config.yml` has more configuration options. View the [example file](./config/config.example.yml) for details.

## Concept overview

## Linting

Run pylint against spamcheck source code:

```bash
make lint
```

## Testing

Tests are located in the `./tests` directory and mirror the python module layout.

### Test suite

Run the test suite:

```bash
make test
```

### Manually

Start the service via `make run`.

Test the gRPC endpoint with a python gRPC client.

```bash
python client.py
```

Test the gRPC endpoint with [grpcurl](https://github.com/fullstorydev/grpcurl):

```bash
grpcurl -plaintext -d "$(cat examples/checkforspamissue.json)" localhost:8001 spamcheck.SpamcheckService/CheckForSpamIssue
```

By default, `grpcurl` will return an empty object, e.g. `{}`, because Protobufs don't
encode 0-value fields and `SpamVerdict.Verdict` is an enum whose default value is `ALLOW` which is 0.

There is a `/healthz` endpoint used for Kubernetes probes and uptime checks:

```bash
grpcurl -plaintext localhost:8001 spamcheck.SpamcheckService/Healthz
```

## Publishing a new gem version

1.  Build new Ruby protobuf files:
    ```bash
    make proto_ruby
    ```

2.  Make sure to update `ruby/spamcheck/version.rb` with the new version number. Keep in mind that `bundle update` uses `--major` by default, so all minor version updates should not break backward or cross-protocol compatibility.

3.  Sign up for a [RubyGems account](RubyGems.org) if you already don't have one. You should also be an owner of the [spamcheck gem](https://rubygems.org/gems/spamcheck)

4.  Create a tag for spamcheck with the version number e.g. if the version number is 0.1.0, do:

    ```bash
    git tag v0.1.0
    ```

5.  Check that the tag has been correctly created alongside your latest commit:

    ```bash
    git show v0.1.0
    ```

6.  Push the branch with the tag:

    ```bash
    git push --tags origin <your branch>
    ```

7.  Run `bundle exec ruby _support/publish-gem v0.1.0` locally. It should ask you for your rubygems email and password.

8.  After a successful push, the new gem version should now be publicly available on [RubyGems.org](https://rubygems.org/gems/spamcheck) and ready to use.