File: README_.md

package info (click to toggle)
python-hdmedians 0.14.2-6
links: PTS, VCS
area: main
in suites: forky, sid, trixie
size: 660 kB
sloc: python: 327; makefile: 32; sh: 12
file content (130 lines) | stat: -rw-r--r-- 5,208 bytes
parent folder | download | duplicates (4)
# Hdmedians

Did you know there is no unique way to mathematically extend the concept of a
[median](https://en.wikipedia.org/wiki/Median) to higher dimensions?

Various definitions for a **high-dimensional median** exist and this Python
package provides a number of fast implementations of these definitions.
Medians are extremely useful due to their high breakdown point (up to 50%
contamination) and have a number of nice applications in machine learning,
computer vision, and high-dimensional statistics.

<p align="center">
<img src="https://rawgit.com/daleroberts/hdmedians/master/docs/fig1.svg" width=600px height=180px />
</p>

This package currently has implementations of [medoid](#medoid) and [geometric
median](#geometric-median) with support for missing data using `NaN`. 

### Installation

The latest version of the package is always available on [pypi](https://pypi.python.org/pypi/hdmedians), 
so can be easily installed by typing:
```{sh}
pip3 install hdmedians
```

## Medoid

Given a finite set $\mathbb{X}$ of $p$-dimensional observation vectors $\mathbb{X}=\{\mathbf{x}_1, \ldots, \mathbf{x}_n\}$, 
the [medoid](https://en.wikipedia.org/wiki/Medoid) $\mathbf{m}$ of these observations is given by
$$
  \mathbf{m} := \operatorname{argmin}_{\mathbf{x} \in \mathbb{X}} \sum_{i=1}^n \|\mathbf{x} - \mathbf{x}_i\|.
$$

The current implementation of `medoid` is in vectorized Python and can handle
any data type supported by
[ndarray](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.html).
If you would like the algorithm to take care of missing values encoded as `nan`
then you can use the `nanmedoid` function.

### Examples

Create an 6 x 10 array of random integer observations.
```{python}
>>> import numpy as np
>>> X = np.random.randint(100, size=(6, 10))
array([[12,  9, 61, 76,  2, 17, 12, 11, 26,  0],
       [65, 72,  7, 64, 21, 92, 51, 48,  9, 65],
       [39,  7, 50, 56, 29, 79, 47, 45, 10, 52],
       [70, 12, 23, 97, 86, 14, 42, 90, 15, 16],
       [13,  7,  2, 47, 80, 53, 23, 59,  7, 15],
       [83,  2, 40, 12, 22, 75, 69, 61, 28, 53]])
```

Find the medoid, taking the last axis as the number of observations.
```{python}
>>> import hdmedians as hd
>>> hd.medoid(X)
array([12, 51, 47, 42, 23, 69])
```

Take the first axis as the number of observations.
```{python}
>>> hd.medoid(X, axis=0)
array([39,  7, 50, 56, 29, 79, 47, 45, 10, 52])
```

Since the medoid is one of the observations, the `medoid` function has the ability to only return the index if required.
```{python}
>>> hd.medoid(X, indexonly=True)
6
>>> X[:,6]
array([12, 51, 47, 42, 23, 69])
```

## Geometric Median

The [geometric median](https://en.wikipedia.org/wiki/Geometric_median) is also known as the 1-median, spatial median,
Euclidean minisum, or Torricelli point. Given a finite set $\mathbb{X}$ of $p$-dimensional observation vectors $\mathbb{X}=\{\mathbf{x}_1, \ldots, \mathbf{x}_n\}$, 
the geometric median $\hat{\mu}$ of these observations is given by
$$
  \hat \mu := \operatorname{argmin}_{\mathbf{x} \in \mathbb{R}^p} \sum_{i=1}^n \|\mathbf{x} - \mathbf{x}_i\|.
$$
Note there is a subtle difference between the definition of the geometric median and the medoid: the search space 
for the solution differs and has the effect that the medoid returns one of the true observations whereas the geometric median can be described 
as a synthetic (not physically observed) observation.

The current implementation of `geomedian` uses Cython and can handle `float64`
or `float32`. If you would like the algorithm to take care of missing values
encoded as `nan` then you can use the `nangeomedian` function.

### Examples

Create an 6 x 10 array of random `float64` observations.
```{python}
>>> import numpy as np
>>> np.set_printoptions(precision=4, linewidth=200)
>>> X = np.random.normal(1, size=(6, 10))
array([[ 1.1079,  0.5763,  0.3072,  1.2205,  0.8596, -1.5082,  2.5955,  2.8251,  1.5908,  0.4575],
       [ 1.555 ,  1.7903,  1.213 ,  1.1285,  0.0461, -0.4929, -0.1158,  0.5879,  1.5807,  0.5828],
       [ 2.1583,  3.4429,  0.4166,  1.0192,  0.8308, -0.1468,  2.6329,  2.2239,  0.2168,  0.8783],
       [ 0.7382,  1.9453,  0.567 ,  0.6797,  1.1654, -0.1556,  0.9934,  0.1857,  1.369 ,  2.1855],
       [ 0.1727,  0.0835,  0.5416,  1.4416,  1.6921,  1.6636,  1.6421,  1.0687,  0.6075, -0.0301],
       [ 2.6654,  1.6741,  1.1568,  1.3092,  1.6944,  0.2574,  2.8604,  1.6102,  0.4301, -0.3876]])
>>> X.dtype
dtype('float64')
```

Find the geometric median, taking the last axis as the number of observations.
```{python}
>>> import hdmedians as hd
>>> np.array(hd.geomedian(X))
array([ 1.0733,  0.8974,  1.1935,  0.9122,  0.9975,  1.3422])
```

Take the first axis as the number of observations.
```{python}
>>> np.array(hd.geomedian(X, axis=0))
array([ 1.4581,  1.6377,  0.7147,  1.1257,  1.0493, -0.091 ,  1.7907,  1.4168,  0.9587,  0.6195])
```

Convert to `float32` and compute the geometric median.
```{python}
>>> X = X.astype(np.float32)
>>> m = hd.geomedian(X)
```

## References

  * Small, C. G. (1990). [A survey of multidimensional medians](http://www.jstor.org/stable/1403809). *International Statistical Review/Revue Internationale de Statistique*, 263-277.