File: repl.txt

package info (click to toggle)
node-stdlib 0.0.96%2Bds1%2B~cs0.0.429-2
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 421,476 kB
  • sloc: javascript: 1,562,831; ansic: 109,702; lisp: 49,823; cpp: 27,224; python: 7,871; sh: 6,807; makefile: 6,089; fortran: 3,102; awk: 387
file content (113 lines) | stat: -rw-r--r-- 4,821 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113

{{alias}}( k[, ndims][, options] )
    Returns an accumulator function which incrementally partitions data into `k`
    clusters.

    If not provided initial centroids, the accumulator caches data vectors for
    subsequent centroid initialization. Until an accumulator computes initial
    centroids, an accumulator returns `null`.

    Once an accumulator has initial centroids (either provided or computed), if
    provided a data vector, the accumulator function returns updated cluster
    results. If not provided a data vector, the accumulator function returns the
    current cluster results.

    Cluster results are comprised of the following:

    - centroids: a `k x ndims` matrix containing centroid locations. Each
      centroid is the component-wise mean of the data points assigned to a
      centroid's corresponding cluster.
    - stats: a `k x 4` matrix containing cluster statistics.

    Cluster statistics consists of the following columns:

    - 0: number of data points assigned to a cluster.
    - 1: total within-cluster sum of squared distances.
    - 2: arithmetic mean of squared distances.
    - 3: corrected sample standard deviation of squared distances.

    Because an accumulator incrementally partitions data, one should *not*
    expect cluster statistics to match similar statistics had provided data been
    analyzed via a batch algorithm. In an incremental context, data points which
    would not be considered part of a particular cluster when analyzed via a
    batch algorithm may contribute to that cluster's statistics when analyzed
    incrementally.

    In general, the more data provided to an accumulator, the more reliable the
    cluster statistics.

    Parameters
    ----------
    k: integer|ndarray
        Number of clusters or a `k x ndims` matrix containing initial centroids.

    ndims: integer (optional)
        Number of dimensions. This argument must accompany an integer-valued
        first argument.

    options: Object (optional)
        Function options.

    options.metric: string (optional)
        Distance metric. Must be one of the following: 'euclidean', 'cosine', or
        'correlation'. Default: 'euclidean'.

    options.init: ArrayLike (optional)
        Centroid initialization method and associated (optional) parameters. The
        first array element specifies the initialization method and must be one
        of the following: 'kmeans++', 'sample', or 'forgy'. The second array
        element specifies the number of data points to use when calculating
        initial centroids. When performing kmeans++ initialization, the third
        array element specifies the number of trials to perform when randomly
        selecting candidate centroids. Typically, more trials is correlated with
        initial centroids which lead to better clustering; however, a greater
        number of trials increases computational overhead. Default: ['kmeans++',
        k, 2+⌊ln(k)⌋ ].

    options.normalize: boolean (optional)
        Boolean indicating whether to normalize incoming data. This option is
        only relevant for non-Euclidean distance metrics. If set to `true`, an
        accumulator partitioning data based on cosine distance normalizes
        provided data to unit Euclidean length. If set to `true`, an accumulator
        partitioning data based on correlation distance first centers provided
        data and then normalizes data dimensions to have zero mean and unit
        variance. If this option is set to `false` and the metric is either
        cosine or correlation distance, then incoming data *must* already be
        normalized. Default: true.

    options.copy: boolean (optional)
        Boolean indicating whether to copy incoming data to prevent mutation
        during normalization. Default: true.

    options.seed: any (optional)
        PRNG seed. Setting this option is useful when wanting reproducible
        initial centroids.

    Returns
    -------
    acc: Function
        Accumulator function.

    acc.predict: Function
        Predicts centroid assignment for each data point in a provided matrix.
        To specify an output vector, provide a 1-dimensional ndarray as the
        first argument. Each element in the returned vector corresponds to a
        predicted cluster index for a respective data point.

    Examples
    --------
    > var accumulator = {{alias}}( 5, 2 );
    > var buf = new {{alias:@stdlib/array/float64}}( 2 );
    > var shape = [ 2 ];
    > var strides = [ 1 ];
    > var v = {{alias:@stdlib/ndarray/ctor}}( 'float64', buf, shape, strides, 0, 'row-major' );
    > v.set( 0, 2.0 );
    > v.set( 1, 1.0 );
    > out = accumulator( v );
    > v.set( 0, -5.0 );
    > v.set( 1, 3.14 );
    > out = accumulator( v );

    See Also
    --------