1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282
|
.. _related_projects:
=====================================
Related Projects
=====================================
Projects implementing the scikit-learn estimator API are encouraged to use
the `scikit-learn-contrib template <https://github.com/scikit-learn-contrib/project-template>`_
which facilitates best practices for testing and documenting estimators.
The `scikit-learn-contrib GitHub organisation <https://github.com/scikit-learn-contrib/scikit-learn-contrib>`_
also accepts high-quality contributions of repositories conforming to this
template.
Below is a list of sister-projects, extensions and domain specific packages.
Interoperability and framework enhancements
-------------------------------------------
These tools adapt scikit-learn for use with other technologies or otherwise
enhance the functionality of scikit-learn's estimators.
**Data formats**
- `sklearn_pandas <https://github.com/paulgb/sklearn-pandas/>`_ bridge for
scikit-learn pipelines and pandas data frame with dedicated transformers.
- `sklearn_xarray <https://github.com/phausamann/sklearn-xarray/>`_ provides
compatibility of scikit-learn estimators with xarray data structures.
**Auto-ML**
- `auto_ml <https://github.com/ClimbsRocks/auto_ml/>`_
Automated machine learning for production and analytics, built on scikit-learn
and related projects. Trains a pipeline wth all the standard machine learning
steps. Tuned for prediction speed and ease of transfer to production environments.
- `auto-sklearn <https://github.com/automl/auto-sklearn/>`_
An automated machine learning toolkit and a drop-in replacement for a
scikit-learn estimator
- `TPOT <https://github.com/rhiever/tpot>`_
An automated machine learning toolkit that optimizes a series of scikit-learn
operators to design a machine learning pipeline, including data and feature
preprocessors as well as the estimators. Works as a drop-in replacement for a
scikit-learn estimator.
- `scikit-optimize <https://scikit-optimize.github.io/>`_
A library to minimize (very) expensive and noisy black-box functions. It
implements several methods for sequential model-based optimization, and
includes a replacement for ``GridSearchCV`` or ``RandomizedSearchCV`` to do
cross-validated parameter search using any of these strategies.
**Experimentation frameworks**
- `REP <https://github.com/yandex/REP>`_ Environment for conducting data-driven
research in a consistent and reproducible way
- `ML Frontend <https://github.com/jeff1evesque/machine-learning>`_ provides
dataset management and SVM fitting/prediction through
`web-based <https://github.com/jeff1evesque/machine-learning#web-interface>`_
and `programmatic <https://github.com/jeff1evesque/machine-learning#programmatic-interface>`_
interfaces.
- `Scikit-Learn Laboratory
<https://skll.readthedocs.io/en/latest/index.html>`_ A command-line
wrapper around scikit-learn that makes it easy to run machine learning
experiments with multiple learners and large feature sets.
- `Xcessiv <https://github.com/reiinakano/xcessiv>`_ is a notebook-like
application for quick, scalable, and automated hyperparameter tuning
and stacked ensembling. Provides a framework for keeping track of
model-hyperparameter combinations.
**Model inspection and visualisation**
- `eli5 <https://github.com/TeamHG-Memex/eli5/>`_ A library for
debugging/inspecting machine learning models and explaining their
predictions.
- `mlxtend <https://github.com/rasbt/mlxtend>`_ Includes model visualization
utilities.
- `scikit-plot <https://github.com/reiinakano/scikit-plot>`_ A visualization library
for quick and easy generation of common plots in data analysis and machine learning.
- `yellowbrick <https://github.com/DistrictDataLabs/yellowbrick>`_ A suite of
custom matplotlib visualizers for scikit-learn estimators to support visual feature
analysis, model selection, evaluation, and diagnostics.
**Model export for production**
- `sklearn-pmml <https://github.com/alex-pirozhenko/sklearn-pmml>`_
Serialization of (some) scikit-learn estimators into PMML.
- `sklearn2pmml <https://github.com/jpmml/sklearn2pmml>`_
Serialization of a wide variety of scikit-learn estimators and transformers
into PMML with the help of `JPMML-SkLearn <https://github.com/jpmml/jpmml-sklearn>`_
library.
- `sklearn-porter <https://github.com/nok/sklearn-porter>`_
Transpile trained scikit-learn models to C, Java, Javascript and others.
- `sklearn-compiledtrees <https://github.com/ajtulloch/sklearn-compiledtrees/>`_
Generate a C++ implementation of the predict function for decision trees (and
ensembles) trained by sklearn. Useful for latency-sensitive production
environments.
Other estimators and tasks
--------------------------
Not everything belongs or is mature enough for the central scikit-learn
project. The following are projects providing interfaces similar to
scikit-learn for additional learning algorithms, infrastructures
and tasks.
**Structured learning**
- `Seqlearn <https://github.com/larsmans/seqlearn>`_ Sequence classification
using HMMs or structured perceptron.
- `HMMLearn <https://github.com/hmmlearn/hmmlearn>`_ Implementation of hidden
markov models that was previously part of scikit-learn.
- `PyStruct <https://pystruct.github.io>`_ General conditional random fields
and structured prediction.
- `pomegranate <https://github.com/jmschrei/pomegranate>`_ Probabilistic modelling
for Python, with an emphasis on hidden Markov models.
- `sklearn-crfsuite <https://github.com/TeamHG-Memex/sklearn-crfsuite>`_
Linear-chain conditional random fields
(`CRFsuite <http://www.chokkan.org/software/crfsuite/>`_ wrapper with
sklearn-like API).
**Deep neural networks etc.**
- `pylearn2 <http://deeplearning.net/software/pylearn2/>`_ A deep learning and
neural network library build on theano with scikit-learn like interface.
- `sklearn_theano <https://sklearn-theano.github.io/>`_ scikit-learn compatible
estimators, transformers, and datasets which use Theano internally
- `nolearn <https://github.com/dnouri/nolearn>`_ A number of wrappers and
abstractions around existing neural network libraries
- `keras <https://github.com/fchollet/keras>`_ Deep Learning library capable of
running on top of either TensorFlow or Theano.
- `lasagne <https://github.com/Lasagne/Lasagne>`_ A lightweight library to
build and train neural networks in Theano.
- `skorch <https://github.com/dnouri/skorch>`_ A scikit-learn compatible
neural network library that wraps PyTorch.
**Broad scope**
- `mlxtend <https://github.com/rasbt/mlxtend>`_ Includes a number of additional
estimators as well as model visualization utilities.
- `sparkit-learn <https://github.com/lensacom/sparkit-learn>`_ Scikit-learn
API and functionality for PySpark's distributed modelling.
**Other regression and classification**
- `xgboost <https://github.com/dmlc/xgboost>`_ Optimised gradient boosted decision
tree library.
- `ML-Ensemble <https://mlens.readthedocs.io/>`_ Generalized
ensemble learning (stacking, blending, subsemble, deep ensembles,
etc.).
- `lightning <https://github.com/scikit-learn-contrib/lightning>`_ Fast
state-of-the-art linear model solvers (SDCA, AdaGrad, SVRG, SAG, etc...).
- `py-earth <https://github.com/scikit-learn-contrib/py-earth>`_ Multivariate
adaptive regression splines
- `Kernel Regression <https://github.com/jmetzen/kernel_regression>`_
Implementation of Nadaraya-Watson kernel regression with automatic bandwidth
selection
- `gplearn <https://github.com/trevorstephens/gplearn>`_ Genetic Programming
for symbolic regression tasks.
- `multiisotonic <https://github.com/alexfields/multiisotonic>`_ Isotonic
regression on multidimensional features.
- `scikit-multilearn <https://scikit.ml>`_ Multi-label classification with
focus on label space manipulation.
- `seglearn <https://github.com/dmbee/seglearn>`_ Time series and sequence
learning using sliding window segmentation.
**Decomposition and clustering**
- `lda <https://github.com/ariddell/lda/>`_: Fast implementation of latent
Dirichlet allocation in Cython which uses `Gibbs sampling
<https://en.wikipedia.org/wiki/Gibbs_sampling>`_ to sample from the true
posterior distribution. (scikit-learn's
:class:`sklearn.decomposition.LatentDirichletAllocation` implementation uses
`variational inference
<https://en.wikipedia.org/wiki/Variational_Bayesian_methods>`_ to sample from
a tractable approximation of a topic model's posterior distribution.)
- `Sparse Filtering <https://github.com/jmetzen/sparse-filtering>`_
Unsupervised feature learning based on sparse-filtering
- `kmodes <https://github.com/nicodv/kmodes>`_ k-modes clustering algorithm for
categorical data, and several of its variations.
- `hdbscan <https://github.com/scikit-learn-contrib/hdbscan>`_ HDBSCAN and Robust Single
Linkage clustering algorithms for robust variable density clustering.
- `spherecluster <https://github.com/clara-labs/spherecluster>`_ Spherical
K-means and mixture of von Mises Fisher clustering routines for data on the
unit hypersphere.
**Pre-processing**
- `categorical-encoding
<https://github.com/scikit-learn-contrib/categorical-encoding>`_ A
library of sklearn compatible categorical variable encoders.
- `imbalanced-learn
<https://github.com/scikit-learn-contrib/imbalanced-learn>`_ Various
methods to under- and over-sample datasets.
Statistical learning with Python
--------------------------------
Other packages useful for data analysis and machine learning.
- `Pandas <https://pandas.pydata.org/>`_ Tools for working with heterogeneous and
columnar data, relational queries, time series and basic statistics.
- `theano <http://deeplearning.net/software/theano/>`_ A CPU/GPU array
processing framework geared towards deep learning research.
- `statsmodels <https://www.statsmodels.org>`_ Estimating and analysing
statistical models. More focused on statistical tests and less on prediction
than scikit-learn.
- `PyMC <https://pymc-devs.github.io/pymc/>`_ Bayesian statistical models and
fitting algorithms.
- `Sacred <https://github.com/IDSIA/Sacred>`_ Tool to help you configure,
organize, log and reproduce experiments
- `Seaborn <http://stanford.edu/~mwaskom/software/seaborn/>`_ Visualization library based on
matplotlib. It provides a high-level interface for drawing attractive statistical graphics.
- `Deep Learning <http://deeplearning.net/software_links/>`_ A curated list of deep learning
software libraries.
Domain specific packages
~~~~~~~~~~~~~~~~~~~~~~~~
- `scikit-image <https://scikit-image.org/>`_ Image processing and computer
vision in python.
- `Natural language toolkit (nltk) <https://www.nltk.org/>`_ Natural language
processing and some machine learning.
- `gensim <https://radimrehurek.com/gensim/>`_ A library for topic modelling,
document indexing and similarity retrieval
- `NiLearn <https://nilearn.github.io/>`_ Machine learning for neuro-imaging.
- `AstroML <https://www.astroml.org/>`_ Machine learning for astronomy.
- `MSMBuilder <http://msmbuilder.org/>`_ Machine learning for protein
conformational dynamics time series.
- `scikit-surprise <https://surpriselib.com/>`_ A scikit for building and
evaluating recommender systems.
Snippets and tidbits
---------------------
The `wiki <https://github.com/scikit-learn/scikit-learn/wiki/Third-party-projects-and-code-snippets>`_ has more!
|