File: faq.rst

package info (click to toggle)
scikit-learn 0.18-5
  • links: PTS, VCS
  • area: main
  • in suites: stretch
  • size: 71,040 kB
  • ctags: 91,142
  • sloc: python: 97,257; ansic: 8,360; cpp: 5,649; makefile: 242; sh: 238
file content (334 lines) | stat: -rw-r--r-- 16,302 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
.. _faq:

===========================
Frequently Asked Questions
===========================

Here we try to give some answers to questions that regularly pop up on the mailing list.

What is the project name (a lot of people get it wrong)?
--------------------------------------------------------
scikit-learn, but not scikit or SciKit nor sci-kit learn.
Also not scikits.learn or scikits-learn, which were previously used.

How do you pronounce the project name?
------------------------------------------
sy-kit learn. sci stands for science!

Why scikit?
------------
There are multiple scikits, which are scientific toolboxes build around SciPy.
You can find a list at `<https://scikits.appspot.com/scikits>`_.
Apart from scikit-learn, another popular one is `scikit-image <http://scikit-image.org/>`_.

How can I contribute to scikit-learn?
-----------------------------------------
See :ref:`contributing`. Before wanting to add a new algorithm, which is
usually a major and lengthy undertaking, it is recommended to start with :ref:`known
issues <easy_issues>`.

What's the best way to get help on scikit-learn usage?
--------------------------------------------------------------
**For general machine learning questions**, please use
`Cross Validated <http://stats.stackexchange.com>`_ with the ``[machine-learning]`` tag.

**For scikit-learn usage questions**, please use `Stack Overflow <http://stackoverflow.com/questions/tagged/scikit-learn>`_
with the ``[scikit-learn]`` and ``[python]`` tags. You can alternatively use the `mailing list
<https://mail.python.org/mailman/listinfo/scikit-learn>`_.

Please make sure to include a minimal reproduction code snippet (ideally shorter
than 10 lines) that highlights your problem on a toy dataset (for instance from
``sklearn.datasets`` or randomly generated with functions of ``numpy.random`` with
a fixed random seed). Please remove any line of code that is not necessary to
reproduce your problem.

The problem should be reproducible by simply copy-pasting your code snippet in a Python
shell with scikit-learn installed. Do not forget to include the import statements.

More guidance to write good reproduction code snippets can be found at:

http://stackoverflow.com/help/mcve

If your problem raises an exception that you do not understand (even after googling it),
please make sure to include the full traceback that you obtain when running the
reproduction script.

For bug reports or feature requests, please make use of the
`issue tracker on Github <https://github.com/scikit-learn/scikit-learn/issues>`_.

There is also a `scikit-learn Gitter channel
<https://gitter.im/scikit-learn/scikit-learn>`_ where some users and developers
might be found. 

**Please do not email any authors directly to ask for assistance, report bugs,
or for any other issue related to scikit-learn.**

How can I create a bunch object?
------------------------------------------------

Don't make a bunch object! They are not part of the scikit-learn API. Bunch
objects are just a way to package some numpy arrays. As a scikit-learn user you
only ever need numpy arrays to feed your model with data.

For instance to train a classifier, all you need is a 2D array ``X`` for the
input variables and a 1D array ``y`` for the target variables. The array ``X``
holds the features as columns and samples as rows . The array ``y`` contains
integer values to encode the class membership of each sample in ``X``.

To load data as numpy arrays you can use different libraries depending on the
original data format:

* `numpy.loadtxt
  <http://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html>`_ to
  load text files (such as CSV) assuming that all the columns have an
  homogeneous data type (e.g. all numeric values).

* `scipy.io <http://docs.scipy.org/doc/scipy/reference/io.html>`_ for common
  binary formats often used in scientific computing context.

* `scipy.misc.imread <http://docs.scipy.org/doc/scipy/reference/generated/scipy.
  misc.imread.html#scipy.misc.imread>`_ (requires the `Pillow
  <https://pypi.python.org/pypi/Pillow>`_ package) to load pixel intensities
  data from various image file formats.

* `pandas.io <http://pandas.pydata.org/pandas-docs/stable/io.html>`_ to load
  heterogeneously typed data from various file formats and database protocols
  that can slice and dice before conversion to numerical features in a numpy
  array.

Note: if you manage your own numerical data it is recommended to use an
optimized file format such as HDF5 to reduce data load times. Various libraries
such as H5Py, PyTables and pandas provides a Python interface for reading and
writing data in that format.

What are the inclusion criteria for new algorithms ?
----------------------------------------------------

We only consider well-established algorithms for inclusion. A rule of thumb is
at least 3 years since publication, 200+ citations and wide use and
usefulness. A technique that provides a clear-cut improvement (e.g. an
enhanced data structure or a more efficient approximation technique) on
a widely-used method will also be considered for inclusion.

From the algorithms or techniques that meet the above criteria, only those
which fit well within the current API of scikit-learn, that is a ``fit``,
``predict/transform`` interface and ordinarily having input/output that is a
numpy array or sparse matrix, are accepted.

The contributor should support the importance of the proposed addition with
research papers and/or implementations in other similar packages, demonstrate
its usefulness via common use-cases/applications and corroborate performance
improvements, if any, with benchmarks and/or plots. It is expected that the 
proposed algorithm should outperform the methods that are already implemented
in scikit-learn at least in some areas.

Also note that your implementation need not be in scikit-learn to be used
together with scikit-learn tools. You can implement your favorite algorithm in
a scikit-learn compatible way, upload it to github and let us know. We will
list it under :ref:`related_projects`.

.. _selectiveness:

Why are you so selective on what algorithms you include in scikit-learn?
------------------------------------------------------------------------
Code is maintenance cost, and we need to balance the amount of
code we have with the size of the team (and add to this the fact that
complexity scales non linearly with the number of features).
The package relies on core developers using their free time to
fix bugs, maintain code and review contributions.
Any algorithm that is added needs future attention by the developers,
at which point the original author might long have lost interest.
Also see `this thread on the mailing list
<https://sourceforge.net/p/scikit-learn/mailman/scikit-learn-general/thread/CAAkaFLWcBG+gtsFQzpTLfZoCsHMDv9UG5WaqT0LwUApte0TVzg@mail.gmail.com/#msg33104380>`_.

Why did you remove HMMs from scikit-learn?
--------------------------------------------
See :ref:`adding_graphical_models`.

.. _adding_graphical_models:

Will you add graphical models or sequence prediction to scikit-learn?
---------------------------------------------------------------------

Not in the foreseeable future.
scikit-learn tries to provide a unified API for the basic tasks in machine
learning, with pipelines and meta-algorithms like grid search to tie
everything together. The required concepts, APIs, algorithms and
expertise required for structured learning are different from what
scikit-learn has to offer. If we started doing arbitrary structured
learning, we'd need to redesign the whole package and the project
would likely collapse under its own weight.

There are two project with API similar to scikit-learn that
do structured prediction:

* `pystruct <http://pystruct.github.io/>`_ handles general structured
  learning (focuses on SSVMs on arbitrary graph structures with
  approximate inference; defines the notion of sample as an instance of
  the graph structure)

* `seqlearn <http://larsmans.github.io/seqlearn/>`_ handles sequences only
  (focuses on exact inference; has HMMs, but mostly for the sake of
  completeness; treats a feature vector as a sample and uses an offset encoding
  for the dependencies between feature vectors)

Will you add GPU support?
-------------------------

No, or at least not in the near future. The main reason is that GPU support
will introduce many software dependencies and introduce platform specific
issues. scikit-learn is designed to be easy to install on a wide variety of
platforms. Outside of neural networks, GPUs don't play a large role in machine
learning today, and much larger gains in speed can often be achieved by a
careful choice of algorithms.

Do you support PyPy?
--------------------

In case you didn't know, `PyPy <http://pypy.org/>`_ is the new, fast,
just-in-time compiling Python implementation. We don't support it.
When the `NumPy support <http://buildbot.pypy.org/numpy-status/latest.html>`_
in PyPy is complete or near-complete, and SciPy is ported over as well,
we can start thinking of a port.
We use too much of NumPy to work with a partial implementation.

How do I deal with string data (or trees, graphs...)?
-----------------------------------------------------

scikit-learn estimators assume you'll feed them real-valued feature vectors.
This assumption is hard-coded in pretty much all of the library.
However, you can feed non-numerical inputs to estimators in several ways.

If you have text documents, you can use a term frequency features; see
:ref:`text_feature_extraction` for the built-in *text vectorizers*.
For more general feature extraction from any kind of data, see
:ref:`dict_feature_extraction` and :ref:`feature_hashing`.

Another common case is when you have non-numerical data and a custom distance
(or similarity) metric on these data. Examples include strings with edit
distance (aka. Levenshtein distance; e.g., DNA or RNA sequences). These can be
encoded as numbers, but doing so is painful and error-prone. Working with
distance metrics on arbitrary data can be done in two ways.

Firstly, many estimators take precomputed distance/similarity matrices, so if
the dataset is not too large, you can compute distances for all pairs of inputs.
If the dataset is large, you can use feature vectors with only one "feature",
which is an index into a separate data structure, and supply a custom metric
function that looks up the actual data in this data structure. E.g., to use
DBSCAN with Levenshtein distances::

    >>> from leven import levenshtein       # doctest: +SKIP
    >>> import numpy as np
    >>> from sklearn.cluster import dbscan
    >>> data = ["ACCTCCTAGAAG", "ACCTACTAGAAGTT", "GAATATTAGGCCGA"]
    >>> def lev_metric(x, y):
    ...     i, j = int(x[0]), int(y[0])     # extract indices
    ...     return levenshtein(data[i], data[j])
    ...
    >>> X = np.arange(len(data)).reshape(-1, 1)
    >>> X
    array([[0],
           [1],
           [2]])
    >>> dbscan(X, metric=lev_metric, eps=5, min_samples=2)  # doctest: +SKIP
    ([0, 1], array([ 0,  0, -1]))

(This uses the third-party edit distance package ``leven``.)

Similar tricks can be used, with some care, for tree kernels, graph kernels,
etc.

Why do I sometime get a crash/freeze with n_jobs > 1 under OSX or Linux?
------------------------------------------------------------------------

Several scikit-learn tools such as ``GridSearchCV`` and ``cross_val_score``
rely internally on Python's `multiprocessing` module to parallelize execution
onto several Python processes by passing ``n_jobs > 1`` as argument.

The problem is that Python ``multiprocessing`` does a ``fork`` system call
without following it with an ``exec`` system call for performance reasons. Many
libraries like (some versions of) Accelerate / vecLib under OSX, (some versions
of) MKL, the OpenMP runtime of GCC, nvidia's Cuda (and probably many others),
manage their own internal thread pool. Upon a call to `fork`, the thread pool
state in the child process is corrupted: the thread pool believes it has many
threads while only the main thread state has been forked. It is possible to
change the libraries to make them detect when a fork happens and reinitialize
the thread pool in that case: we did that for OpenBLAS (merged upstream in
master since 0.2.10) and we contributed a `patch
<https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60035>`_ to GCC's OpenMP runtime
(not yet reviewed).

But in the end the real culprit is Python's ``multiprocessing`` that does
``fork`` without ``exec`` to reduce the overhead of starting and using new
Python processes for parallel computing. Unfortunately this is a violation of
the POSIX standard and therefore some software editors like Apple refuse to
consider the lack of fork-safety in Accelerate / vecLib as a bug.

In Python 3.4+ it is now possible to configure ``multiprocessing`` to use the
'forkserver' or 'spawn' start methods (instead of the default 'fork') to manage
the process pools. This makes it possible to not be subject to this issue
anymore.

If you have custom code that uses ``multiprocessing`` directly instead of using
it via joblib you can enable the 'forkserver' mode globally for your
program: Insert the following instructions in your main script::

    import multiprocessing

    # other imports, custom code, load data, define model...

    if __name__ == '__main__':
        multiprocessing.set_start_method('forkserver')

        # call scikit-learn utils with n_jobs > 1 here

You can find more default on the new start methods in the `multiprocessing
documentation <https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods>`_.

Why is there no support for deep or reinforcement learning / Will there be support for deep or reinforcement learning in scikit-learn?
--------------------------------------------------------------------------------------------------------------------------------------

Deep learning and reinforcement learning both require a rich vocabulary to
define an architecture, with deep learning additionally requiring
GPUs for efficient computing. However, neither of these fit within
the design constraints of scikit-learn; as a result, deep learning
and reinforcement learning are currently out of scope for what
scikit-learn seeks to achieve.

Why is my pull request not getting any attention?
-------------------------------------------------

The scikit-learn review process takes a significant amount of time, and
contributors should not be discouraged by a lack of activity or review on
their pull request. We care a lot about getting things right
the first time, as maintenance and later change comes at a high cost.
We rarely release any "experimental" code, so all of our contributions
will be subject to high use immediately and should be of the highest
quality possible initially.

Beyond that, scikit-learn is limited in its reviewing bandwidth; many of the
reviewers and core developers are working on scikit-learn on their own time.
If a review of your pull request comes slowly, it is likely because the
reviewers are busy. We ask for your understanding and request that you
not close your pull request or discontinue your work solely because of
this reason.

How do I set a ``random_state`` for an entire execution?
---------------------------------------------------------

For testing and replicability, it is often important to have the entire execution
controlled by a single seed for the pseudo-random number generator used in
algorithms that have a randomized component. Scikit-learn does not use its own
global random state; whenever a RandomState instance or an integer random seed
is not provided as an argument, it relies on the numpy global random state,
which can be set using :func:`numpy.random.seed`.
For example, to set an execution's numpy global random state to 42, one could
execute the following in his or her script::

    import numpy as np
    np.random.seed(42)

However, a global random state is prone to modification by other code during
execution. Thus, the only way to ensure replicability is to pass ``RandomState``
instances everywhere and ensure that both estimators and cross-validation
splitters have their ``random_state`` parameter set.