File: decision-vegan.Rnw

package info (click to toggle)
r-cran-vegan 2.5-7%2Bdfsg-1
  • links: PTS, VCS
  • area: main
  • in suites: bookworm, bullseye, sid
  • size: 5,564 kB
  • sloc: ansic: 2,275; fortran: 1,088; sh: 42; makefile: 2
file content (722 lines) | stat: -rw-r--r-- 30,859 bytes parent folder | download | duplicates (4)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
% -*- mode: noweb; noweb-default-code-mode: R-mode; -*-
%\VignetteIndexEntry{Design decisions and implementation}

\documentclass[a4paper,10pt,twocolumn]{article}
\usepackage{vegan} % package options and redefinitions

\author{Jari Oksanen}
\title{Design decisions and implementation details in vegan}

\date{\footnotesize{
  processed with vegan
\Sexpr{packageDescription("vegan", field="Version")}
in \Sexpr{R.version.string} on \today}}

%% need no \usepackage{Sweave}
\begin{document}
\bibliographystyle{jss}

\SweaveOpts{strip.white=true}

<<echo=false,results=hide>>=
figset <- function() par(mar=c(4,4,1,1)+.1)
options(SweaveHooks = list(fig = figset))
options("prompt" = "> ", "continue" = "  ")
options(width = 55)
require(vegan)
@
\maketitle

\begin{abstract}
  This document describes design decisions, and discusses implementation
and algorithmic details in some vegan functions. The proper FAQ is
another document.
\end{abstract}

\tableofcontents

\section{Parallel processing}

Several \pkg{vegan} functions can perform parallel processing using
the standard \R{} package \pkg{parallel}.
The \pkg{parallel} package in \R{} implements
the functionality of earlier contributed packages \pkg{multicore} and
\pkg{snow}.  The \pkg{multicore} functionality forks the analysis to
multiple cores, and \pkg{snow} functionality sets up a socket cluster
of workers.  The \pkg{multicore} functionality only works in unix-like
systems (such as MacOS and Linux), but \pkg{snow} functionality works
in all operating systems.  \pkg{Vegan} can use either method, but
defaults to \pkg{multicore} functionality when this is available,
because its forked clusters are usually faster.  This chapter
describes both the user interface and internal implementation for the
developers.

\subsection{User interface}
\label{sec:parallel:ui}

The functions that are capable of parallel processing have argument
\code{parallel}.  The normal default is \code{parallel = 1} which
means that no parallel processing is performed.  It is possible to set
parallel processing as the default in \pkg{vegan} (see
\S\,\ref{sec:parallel:default}).

For parallel processing, the \code{parallel} argument can be either

\begin{enumerate}
\item An integer in which case the given number of parallel processes
  will be launched (value $1$ launches non-parallel processing). In
  unix-like systems (\emph{e.g.}, MacOS, Linux) these will be forked
  \code{multicore} processes. In Windows socket clusters will be set up,
  initialized and closed.
\item A previously created socket cluster. This saves time as the
  cluster is not set up and closed in the function.  If the argument is a
  socket cluster, it will also be used in unix-like systems. Setting
  up a socket cluster is discussed in \S\,\ref{sec:parallel:socket}.
\end{enumerate}

\subsubsection{Using parallel processing as default}
\label{sec:parallel:default}

If the user sets option \code{mc.cores}, its value will be used as the
default value of the \code{parallel} argument in \pkg{vegan}
functions.  The following command will set up parallel processing to
all subsequent \pkg{vegan} commands:
<<eval=false>>=
options(mc.cores = 2)
@

The \code{mc.cores} option is defined in the \pkg{parallel} package,
but it is usually unset in which case \pkg{vegan} will default to
non-parallel computation.  The \code{mc.cores} option can be set by
the environmental variable \code{MC_CORES} when the \pkg{parallel}
package is loaded.

\R{} allows setting up a default socket cluster
(\code{setDefaultCluster}), but this will not be used in \pkg{vegan}.

\subsubsection{Setting up socket clusters}
\label{sec:parallel:socket}

If socket clusters are used (and they are the only alternative in
Windows), it is often wise to set up a cluster before calling
parallelized code and give the pre-defined cluster as the value of
the \code{parallel} argument in \pkg{vegan}.  If you want to use
socket clusters in unix-like systems (MacOS, Linux), this can be only
done with pre-defined clusters.

If socket cluster is not set up in Windows, \pkg{vegan} will create and
close the cluster within the function body. This involves following commands:
\begin{Schunk}
\begin{Soutput}
clus <- makeCluster(4)
## perform parallel processing
stopCluster(clus)
\end{Soutput}
\end{Schunk}
The first command sets up the cluster, in this case with four
cores, and the second command stops the cluster.

Most parallelized \pkg{vegan} functions work similarly in socket and
fork clusters, but in \code{oecosimu} the parallel processing is used
to evaluate user-defined functions, and their arguments and data must
be made known to the socket cluster.  For example, if you want to run
in parallel the \code{meandist} function of the \code{oecosimu}
example with a pre-defined socket cluster, you must use:
<<eval=false>>=
## start up and define meandist()
library(vegan)
data(sipoo)
meandist <-
    function(x) mean(vegdist(x, "bray"))
library(parallel)
clus <- makeCluster(4)
clusterEvalQ(clus, library(vegan))
mbc1 <- oecosimu(dune, meandist, "r2dtable",
                 parallel = clus)
stopCluster(clus)
@
Socket clusters are used for parallel processing in Windows, but you
do not need to pre-define the socket cluster in \code{oecosimu} if you
only need \pkg{vegan} commands.  However, if you need some other
contributed packages, you must pre-define the socket cluster also in
Windows with appropriate \code{clusterEvalQ} calls.

If you pre-set the cluster, you can also use \pkg{snow} style socket
clusters in unix-like systems.

\subsubsection{Random number generation}

\pkg{Vegan} does not use parallel processing in random number
generation, and you can set the seed for the standard random number
generator. Setting the seed for the parallelized generator (L'Ecuyer)
has no effect in \pkg{vegan}.

\subsubsection{Does it pay off?}

Parallelized processing has a considerable overhead, and the analysis
is faster only if the non-parallel code is really slow (takes several
seconds in wall clock time). The overhead is particularly large in
socket clusters (in Windows). Creating a socket cluster and evaluating
\code{library(vegan)} with \code{clusterEvalQ} can take two seconds or
longer, and only pays off if the non-parallel analysis takes ten
seconds or longer. Using pre-defined clusters will reduce the
overhead. Fork clusters (in unix-likes operating systems) have a
smaller overhead and can be faster, but they also have an overhead.

Each parallel process needs memory, and for a large number of
processes you need much memory.  If the memory is exhausted, the
parallel processes can stall and  take much longer than
non-parallel processes (minutes instead of seconds).

If the analysis is fast, and function runs in, say, less than five
seconds, parallel processing is rarely useful.  Parallel processing is
useful only in slow analyses: large number of replications or
simulations, slow evaluation of each simulation. The danger of memory
exhaustion must always be remembered.

The benefits and potential problems of parallel processing depend on
your particular system: it is best to rely on your own experience.

\subsection{Internals for developers}

The implementation of the parallel processing should accord with the
description of the user interface above (\S\,\ref{sec:parallel:ui}).
Function \code{oecosimu} can be used as a reference implementation,
and similar interpretation and order of interpretation of arguments
should be followed.  All future implementations should be consistent
and all must be changed if the call heuristic changes.

The value of the \code{parallel} argument can be \code{NULL}, a
positive integer or a socket cluster.  Integer $1$ means that no
parallel processing is performed.  The ``normal'' default is
\code{NULL} which in  the ``normal'' case is interpreted as $1$.  Here
``normal'' means that \R{} is run with default settings without
setting \code{mc.cores} or environmental variable \code{MC_CORES}.

Function \code{oecosimu} interprets the \code{parallel} arguments in
the following way:
\begin{enumerate}
\item \code{NULL}: The function is called with argument \code{parallel
    = getOption("mc.cores")}. The option \code{mc.cores} is normally
  unset and then the default is \code{parallel = NULL}.
\item Integer: An integer value is taken as the number of created
  parallel processes.  In unix-like systems this is the number of
  forked multicore processes, and in Windows this is the number of
  workers in socket clusters.  In Windows, the socket cluster is
  created, and if needed \code{library(vegan)} is evaluated in the
  cluster (this is not necessary if the function only uses internal
  functions), and the cluster is stopped after parallel processing.
\item Socket cluster: If a socket cluster is given, it will be used in
  all operating systems, and  the cluster is not stopped
  within the function.
\end{enumerate}

This gives the following precedence order for parallel processing
(highest to lowest):
\begin{enumerate}
  \item Explicitly given argument value of \code{parallel} will always
    be used.
  \item If \code{mc.cores} is set, it will be used. In Windows this
    means creating and stopping socket clusters. Please note
    that the \code{mc.cores} is only set from the environmental
    variable \code{MC_CORES} when you load the \pkg{parallel} package,
    and it is always unset before first
    \code{require(parallel)}.
 \item The fall back behaviour is no parallel processing.
\end{enumerate}

\section{Nestedness and Null models}

Some published indices of nestedness and null models of communities
are only described in general terms, and they could be implemented in
various ways. Here I discuss the implementation in \pkg{vegan}.

\subsection{Matrix temperature}

The matrix temperature is intuitively simple
(Fig. \ref{fig:nestedtemp}), but the the exact calculations were not
explained in the original publication \cite{AtmarPat93}.
\begin{figure}
<<fig=true,echo=false,results=hide>>=
data(sipoo)
mod <- nestedtemp(sipoo)
plot(mod, "i")
x <- mod$c["Falcsubb"]
y <- 1 - mod$r["Svartholm"]
points(x,y, pch=16, cex=1.5)
abline(x+y, -1, lty=2)
f <- function(x, p) (1-(1-x)^p)^(1/p)
cross <- function(x, a, p) f(x,p) - a + x
r <- uniroot(cross, c(0,1), a = x+y, p = mod$p)$root
arrows(x,y, r, f(r, mod$p), lwd=4)
@
\caption{Matrix temperature for \emph{Falco subbuteo} on Sibbo
  Svartholmen (dot). The curve is the fill line, and in a cold
  matrix, all presences (red squares) should be in the upper left
  corner behind the fill line. Dashed diagonal line of length $D$ goes
  through the point, and an arrow of length $d$ connects the point to
  the fill line. The ``surprise'' for this point is $u = (d/D)^2$ and
  the matrix temperature is based on the sum of surprises: presences
  outside the fill line or absences within the fill line.}
\label{fig:nestedtemp}
\end{figure}
The function can be implemented in many ways following the general
principles.  \citet{RodGir06} have seen the original code and reveal
more details of calculations, and their explanation is the basis of
the implementation in \pkg{vegan}.  However, there are still some open
issues, and probably \pkg{vegan} function \code{nestedtemp} will never
exactly reproduce results from other programs, although it is based on
the same general principles.\footnote{function \code{nestedness} in
  the \pkg{bipartite} package is a direct port of the original
  \proglang{BINMATNEST} program of \citet{RodGir06}.}  I try to give
main computation details in this document --- all details can be seen
in the source code of \code{nestedtemp}.

\begin{itemize}
\item Species and sites are put into unit square \citep{RodGir06}. The
  row and column coordinates will be $(k-0.5)/n$ for $k=1 \ldots n$,
  so that there are no points in the corners or the margins of the
  unit square, and a diagonal line can be drawn through any point. I
  do not know how the rows and columns are converted to the unit
  square in other software, and this may be a considerable source of
  differences among implementations.
  \item Species and sites are ordered alternately using indices
    \citep{RodGir06}:
    \begin{equation}
    \begin{split}
      s_j &= \sum_{i|x_{ij} = 1} i^2 \\
      t_j &= \sum_{i|x_{ij} = 0} (n-i+1)^2
    \end{split}
    \end{equation}
    Here $x$ is the data matrix, where $1$ is presence, and $0$ is
    absence, $i$ and $j$ are row and column indices, and $n$ is the
    number of rows. The equations give the indices for columns, but
    the indices can be reversed for corresponding row indexing.
    Ordering by $s$ packs presences to the top left corner, and
    ordering by $t$ pack zeros away from the top left corner. The final
    sorting should be ``a compromise'' \citep{RodGir06} between these
    scores, and \pkg{vegan} uses $s+t$.  The result should be cool,
    but the packing does not try to minimize the temperature
    \citep{RodGir06}.  I do not know how the ``compromise'' is
    defined, and this can cause some differences to other
    implementations.
  \item The following function is used to define the fill line:
    \begin{equation}
      y = (1-(1-x)^p)^{1/p}
    \end{equation}
    This is similar to the equation suggested by
    \citet[eq. 4]{RodGir06}, but omits all terms dependent on the
    numbers of species or sites, because I could not understand why
    they were needed. The differences are visible only in small data
    sets. The $y$ and $x$ are the coordinates in the unit square, and
    the parameter $p$ is selected so that the curve covers the same
    area as is the proportion of presences
    (Fig. \ref{fig:nestedtemp}). The parameter $p$ is found
    numerically using \proglang{R} functions \code{integrate} and
    \code{uniroot}.  The fill line used in the original matrix
    temperature software \citep{AtmarPat93} is supposed to be similar
    \citep{RodGir06}. Small details in the fill line combined with
    differences in scores used in the unit square (especially in the
    corners) can cause large differences in the results.
  \item A line with slope\,$= -1$ is drawn through the point and the $x$
    coordinate of the intersection of this line and the fill line is
    found using function \code{uniroot}. The difference of this
    intersection and the row coordinate gives the argument $d$ of matrix
    temperature (Fig. \ref{fig:nestedtemp}).
  \item In other software, ``duplicated'' species occurring on every
    site are removed, as well as empty sites and species after
    reordering \cite{RodGir06}. This is not done in \pkg{vegan}.
\end{itemize}


\section{Scaling in redundancy analysis}

This chapter discusses the scaling of scores (results) in redundancy
analysis and principal component analysis performed by function
\code{rda} in the \pkg{vegan} library.

Principal component analysis decomposes a centred data matrix
$\mathbf{X} = \{x_{ij}\}$ into $K$ orthogonal components so that
$x_{ij} = \sqrt{n-1} \sum_{k=1}^K u_{ik} \sqrt{\lambda_k} v_{jk}$,
where $u_{ik}$ and $v_{jk}$ are orthonormal coefficient matrices and
$\lambda_k$ are eigenvalues. In \pkg{vegan} the eigenvalues sum up to
variance of the data, and therefore we need to multiply with the
square root of degrees of freedom $n-1$.  Orthonormality means that
sums of squared columns is one and their cross-product is zero, or
$\sum_i u_{ik}^2 = \sum_j v_{jk}^2 = 1$, and
$\sum_i u_{ik} u_{il} = \sum_j v_{jk} v_{jl} = 0$ for $k \neq l$. This
is a decomposition, and the original matrix is found exactly from the
singular vectors and corresponding singular values, and first two
singular components give the rank $=2$ least squares estimate of the
original matrix.

The coefficients $u_{ik}$ and $v_{jk}$ are scaled to unit length for all
axes $k$. Eigenvalues $\lambda_k$ give
the information of the importance of axes, or the `axis lengths.'
Instead of the orthonormal coefficients, or equal length axes, it is
customary to scale species (column) or site (row) scores or both by
eigenvalues to display the importance of axes and to describe the true
configuration of points.  Table \ref{tab:scales} shows some
alternative scalings.  These alternatives apply to principal
components analysis in all cases, and in redundancy analysis, they
apply to species scores and constraints or linear combination scores;
weighted averaging scores have somewhat wider dispersion.

\begin{table*}[t]
  \centering
  \caption{\label{tab:scales} Alternative scalings for \textsc{rda} used
    in the functions \code{prcomp} and \code{princomp}, and the
    one used in the \pkg{vegan} function \code{rda}
    and the proprietary software \proglang{Canoco}
    scores in terms of orthonormal species ($v_{ik}$) and site scores
    ($u_{jk}$), eigenvalues ($\lambda_k$), number of sites  ($n$) and
    species standard deviations ($s_j$). In \code{rda},
    $\mathrm{const} = \sqrt[4]{(n-1) \sum \lambda_k}$.  Corresponding
    negative scaling in \pkg{vegan}
   % and corresponding positive scaling in \texttt{Canoco 3}
    is derived
    dividing each  species by its standard deviation $s_j$ (possibly
    with some additional constant multiplier).  }
 \begin{tabular}{lcc}
  \\
  \toprule
& \textbf{Site scores} $u_{ik}^*$ &
\textbf{Species scores} $v_{jk}^*$ \\
\midrule
\code{prcomp, princomp} &
$u_{ik} \sqrt{n-1} \sqrt{\lambda_k}$ &
$v_{jk}$ \\
\code{stats::biplot} &
$u_{ik}$ &
$v_{jk} \sqrt{n} \sqrt{\lambda_k}$ \\
\code{stats::biplot, pc.biplot=TRUE} &
$u_{ik} \sqrt{n-1}$ &
$v_{jk} \sqrt{\lambda_k}$\\
\code{rda, scaling="sites"} &
$u_{ik} \sqrt{\lambda_k/ \sum \lambda_k} \times \mathrm{const}$ &
$v_{jk} \times \mathrm{const}$
\\
\code{rda, scaling="species"} &
$u_{ik} \times \mathrm{const}$ &
$v_{jk} \sqrt{\lambda_k/ \sum \lambda_k} \times \mathrm{const}$  \\
\code{rda, scaling="symmetric"} &
$u_{ik} \sqrt[4]{\lambda_k/ \sum \lambda_k} \times \mathrm{const}$ &
$v_{jk} \sqrt[4]{\lambda_k/ \sum \lambda_k} \times \mathrm{const}$ \\
\code{rda, correlation=TRUE} &
$u_{ik}^*$ &
$\sqrt{\sum \lambda_k /(n-1)} s_j^{-1} v_{jk}^*$
\\
% \code{Canoco 3, scaling=-1} &
% $u_{ik} \sqrt{n-1} \sqrt{\lambda_k / \sum \lambda_k}$ &
% $v_{jk} \sqrt{n}$ \\
% \code{Canoco 3, scaling=-2} &
% $u_{ik} \sqrt{n-1}$ &
% $v_{jk} \sqrt{n} \sqrt{\lambda_k / \sum \lambda_k}$
% \\
% \code{Canoco 3, scaling=-3} &
% $u_{ik} \sqrt{n-1} \sqrt[4]{\lambda_k / \sum \lambda_k}$ &
% $v_{jk} \sqrt{n} \sqrt[4]{\lambda_k / \sum \lambda_k}$
\bottomrule
\end{tabular}
\end{table*}

In community ecology, it is common to plot both species and sites in
the same graph.  If this graph is a graphical display of \textsc{pca},
or a graphical, low-dimensional approximation of the data, the graph
is called a biplot.  The graph is a biplot if the transformed scores
satisfy $x_{ij} = c \sum_k u_{ij}^* v_{jk}^*$ where $c$ is a scaling
constant.  In functions \code{princomp}, \code{prcomp} and \code{rda}
with \code{scaling = "sites"}, the plotted scores define a biplot so that
the eigenvalues are expressed for sites, and species are left
unscaled.
% For \texttt{Canoco 3} $c = n^{-1} \sqrt{n-1}
% \sqrt{\sum \lambda_k}$ with negative \proglang{Canoco} scaling
% values. All these $c$ are constants for a matrix, so these are all
% biplots with different internal scaling of species and site scores
% with respect to each other.  For \proglang{Canoco} with positive scaling
% values and \pkg{vegan} with negative scaling values, no constant
% $c$ can be found, but the correction is dependent on species standard
% deviations $s_j$, and these scores do not define a biplot.

There is no natural way of scaling species and site scores to each
other.  The eigenvalues in redundancy and principal components
analysis are scale-dependent and change when the data are multiplied
by a constant.  If we have percent cover data, the eigenvalues are
typically very high, and the scores scaled by eigenvalues will have
much wider dispersion than the orthonormal set.  If we express the
percentages as proportions, and divide the matrix by $100$, the
eigenvalues will be reduced by factor $100^2$, and the scores scaled
by eigenvalues will have a narrower dispersion.  For graphical biplots
we should be able to fix the relations of row and column scores to be
invariant against scaling of data.  The solution in \proglang{R}
standard function \code{biplot} is to scale site and species scores
independently, and typically very differently
(Table~\ref{tab:scales}), but plot each independently to fill the
graph area.  The solution in \proglang{Canoco} and \code{rda} is to
use proportional eigenvalues $\lambda_k / \sum \lambda_k$ instead of
original eigenvalues.  These proportions are invariant with scale
changes, and typically they have a nice range for plotting two data
sets in the same graph.

The \textbf{vegan} package uses a scaling constant $c = \sqrt[4]{(n-1)
  \sum \lambda_k}$ in order to be able to use scaling by proportional
eigenvalues (like in \proglang{Canoco}) and still be able to have a
biplot scaling. Because of this, the scaling of \code{rda} scores is
non-standard. However, the \code{scores} function lets you to set
the scaling constant to any desired values. It is also possible to
have two separate scaling constants: the first for the species, and
the second for sites and friends, and this allows getting scores of
other software or \proglang{R} functions (Table \ref{tab:rdaconst}).

\begin{table*}[t]
  \centering
  \caption{\label{tab:rdaconst} Values of the \code{const} argument in
    \textbf{vegan} to get the scores that are equal to those from
    other functions and software. Number of sites (rows) is $n$,
    the number of species (columns) is $m$, and the sum of all
    eigenvalues is $\sum_k \lambda_k$ (this is saved as the item
    \code{tot.chi} in the \code{rda} result)}.
 \begin{tabular}{lccc}
  \\
  \toprule
& \textbf{Scaling} &\textbf{Species constant} & \textbf{Site constant} \\
\midrule
\pkg{vegan} & any  & $\sqrt[4]{(n-1) \sum \lambda_k}$ & $\sqrt[4]{(n-1) \sum \lambda_k}$\\
\code{prcomp}, \code{princomp} & \code{1} & $1$ & $\sqrt{(n-1) \sum_k \lambda_k}$\\
\proglang{Canoco\,v3} & \code{-1, -2, -3} & $\sqrt{n-1}$ & $\sqrt{n}$\\
\proglang{Canoco\,v4} & \code{-1, -2, -3} & $\sqrt{m}$ & $\sqrt{n}$\\
\bottomrule
\end{tabular}
\end{table*}

The scaling is controlled by three arguments in the \code{scores}
function in \pkg{vegan}:
\begin{enumerate}
  \item \code{scaling} with options \code{"sites"}, \code{"species"}
    and \code{"symmetric"} defines the set of scores which is scaled
    by eigenvalues (Table~\ref{tab:scales}).
  \item \code{const} can be used to set the numeric scaling constant
    to non-default values (Table~\ref{tab:rdaconst}).
  \item \code{correlation} can be used to modify species scores so
    that they show the relative change of species abundance, or their
    correlation with the ordination (Table~\ref{tab:scales}). This is
    no longer a biplot scaling.
\end{enumerate}

\section{Weighted average and linear combination scores}

Constrained ordination methods such as Constrained Correspondence
Analysis (CCA) and Redundancy Analysis (RDA) produce two kind of site
scores \cite{Braak86, Palmer93}:
\begin{itemize}
\item
LC or Linear Combination Scores which are linear combinations of
constraining variables.
\item
WA or Weighted Averages Scores which are such weighted averages of
species scores that are as similar to LC scores as possible.
\end{itemize}
Many computer programs for constrained ordinations give only or
primarily LC scores following recommendation of
\citet{Palmer93}.  However, functions \code{cca} and \code{rda} in
the \pkg{vegan} package use primarily WA scores. This chapter
explains the reasons for this choice.

Briefly, the main reasons are that
\begin{itemize}
\item LC scores \emph{are} linear combinations, so they give us only
  the (scaled) environmental variables. This means that they are
  independent of vegetation and cannot be found from the species
  composition.  Moreover, identical combinations of environmental
  variables give identical LC scores irrespective of vegetation.
\item \citet{McCune97} has demonstrated that noisy environmental
  variables result in deteriorated LC scores whereas WA scores
  tolerate some errors in environmental variables.  All environmental
  measurements contain some errors, and therefore it is safer to use
  WA scores.
\end{itemize}
This article studies mainly the first point.  The users of
\pkg{vegan} have a choice of either LC or WA (default) scores, but
after reading this article, I believe that most of them do not want to
use LC scores, because they are not what they were looking for in
ordination.

\subsection{LC Scores are Linear Combinations}

Let us perform a simple CCA analysis using only two environmental
variables so that we can see the constrained solution completely in
two dimensions:
<<>>=
library(vegan)
data(varespec)
data(varechem)
orig <- cca(varespec ~ Al + K, varechem)
@
Function \code{cca} in \pkg{vegan} uses WA scores as
default. So we must specifically ask for LC scores
(Fig. \ref{fig:ccalc}).
<<a,fig=false>>=
plot(orig, dis=c("lc","bp"))
@
\begin{figure}
<<fig=true,echo=false>>=
<<a>>
@
\caption{LC scores in CCA of the original data.}
\label{fig:ccalc}
\end{figure}

What would happen to linear combinations of LC scores if we shuffle
the ordering of sites in species data?  Function \code{sample()} below
shuffles the indices.
<<>>=
i <- sample(nrow(varespec))
shuff <- cca(varespec[i,] ~ Al + K, varechem)
@
\begin{figure}
<<fig=true,echo=false>>=
plot(shuff, dis=c("lc","bp"))
@
\caption{LC scores of shuffled species data.}
\label{fig:ccashuff}
\end{figure}
It seems that site scores are fairly similar, but oriented differently
(Fig. \ref{fig:ccashuff}).  We can use Procrustes rotation to see how
similar the site scores indeed are (Fig. \ref{fig:ccaproc}).
<<a,fig=false>>=
plot(procrustes(scores(orig, dis="lc"),
                scores(shuff, dis="lc")))
@
\begin{figure}
<<fig=true,echo=false>>=
<<a>>
@
\caption{Procrustes rotation of LC scores from CCA of original and shuffled data.}
\label{fig:ccaproc}
\end{figure}
There is a small difference, but this will disappear if we use
Redundancy Analysis (RDA) instead of CCA
(Fig. \ref{fig:rdaproc}). Here we use a new shuffling as well.
<<>>=
tmp1 <- rda(varespec ~ Al + K, varechem)
i <- sample(nrow(varespec)) # Different shuffling
tmp2 <- rda(varespec[i,] ~ Al + K, varechem)
@
\begin{figure}
<<fig=true,echo=false>>=
plot(procrustes(scores(tmp1, dis="lc"),
                scores(tmp2, dis="lc")))
@
\caption{Procrustes rotation of LC scores in RDA of the original and shuffled data.}
\label{fig:rdaproc}
\end{figure}

LC scores indeed are linear combinations of constraints (environmental
variables) and \emph{independent of species data}: You can
shuffle your species data, or change the data completely, but the LC
scores will be unchanged in RDA.  In CCA the LC scores are
\emph{weighted} linear combinations with site totals of species data
as weights. Shuffling species data in CCA changes the weights, and
this can cause changes in LC scores.  The magnitude of changes depends
on the variability of site totals.

The original data and shuffled data differ in their goodness of
fit:
<<>>=
orig
shuff
@
Similarly their WA scores will be (probably) very different
(Fig. \ref{fig:ccawa}).
\begin{figure}
<<fig=true,echo=false>>=
plot(procrustes(orig, shuff))
@
\caption{Procrustes rotation of WA scores of CCA with the original and
  shuffled data.}
\label{fig:ccawa}
\end{figure}

The example used only two environmental variables so that we can
easily plot all constrained axes.  With a larger number of
environmental variables the full configuration remains similarly
unchanged, but its orientation may change, so that two-dimensional
projections look different.  In the full space, the differences should
remain within numerical accuracy:
<<>>=
tmp1 <- rda(varespec ~ ., varechem)
tmp2 <- rda(varespec[i,] ~ ., varechem)
proc <- procrustes(scores(tmp1, dis="lc", choi=1:14),
                   scores(tmp2, dis="lc", choi=1:14))
max(residuals(proc))
@
In \code{cca} the difference would be somewhat larger than now
observed \Sexpr{format.pval(max(residuals(proc)))} because site
weights used for environmental variables are shuffled with the species
data.

\subsection{Factor constraints}

It seems that users often get confused when they perform constrained
analysis using  only one factor (class variable) as constraint.  The
following example uses the classical dune meadow data \cite{Jongman87}:
<<>>=
data(dune)
data(dune.env)
orig <- cca(dune ~ Moisture, dune.env)
@
When the results are plotted using LC scores, sample plots fall only
in four alternative positions (Fig. \ref{fig:factorlc}).
\begin{figure}
<<fig=TRUE,echo=false>>=
plot(orig, dis="lc")
@
\caption{LC scores of the dune meadow data using only one factor as a
  constraint.}
\label{fig:factorlc}
\end{figure}
In the previous chapter we saw that this happens because LC scores
\emph{are} the environmental variables, and they can be distinct only
if the environmental variables are distinct.  However, normally the user
would like to see how well the environmental variables separate the
vegetation, or inversely, how we could use the vegetation to
discriminate the environmental conditions.  For this purpose we should
plot WA scores, or LC scores and WA scores together:  The LC scores
show where the site \emph{should} be, the WA scores shows where the
site \emph{is}.

Function \code{ordispider} adds line segments to connect each WA
score with the corresponding LC (Fig.  \ref{fig:walcspider}).
<<a,fig=false>>=
plot(orig, display="wa", type="points")
ordispider(orig, col="red")
text(orig, dis="cn", col="blue")
@
\begin{figure}
<<fig=TRUE,echo=false>>=
<<a>>
@
\caption{A ``spider plot'' connecting WA scores to corresponding LC
  scores. The shorter the web segments, the better the ordination.}
\label{fig:walcspider}
\end{figure}
This is the standard way of displaying results of discriminant
analysis, too.  Moisture classes \code{1} and \code{2} seem to be
overlapping, and cannot be completely separated by their
vegetation. Other classes are more distinct, but there seems to be a
clear arc effect or a ``horseshoe'' despite using CCA.

\subsection{Conclusion}

LC scores are only the (weighted and scaled) constraints and
independent of vegetation. If you plot them, you plot only your
environmental variables. WA scores are based on vegetation data but
are constrained to be as similar to the LC scores as only
possible. Therefore \pkg{vegan} calls LC scores as
\code{constraints} and WA scores as \code{site scores}, and uses
primarily WA scores in plotting.  However, the user makes the ultimate
choice, since both scores are available.

\bibliography{vegan}

\end{document}