1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146
|
\subsection{Fitting Lines -- Regression}
\label{sec:linefitting}
{
\pgfkeys{
/pgfmanual/gray key prefixes={/pgfplots/table},
}
This section documents the attempts of \PGFPlots\ to fit lines to input coordinates. \PGFPlots\ currently supports |create col/linear regression| applied to columns of input tables. The feature relies on \PGFPlotstable, it is actually implemented as a table postprocessing method.
\begin{stylekey}{/pgfplots/table/create col/linear regression=\marg{key-value-config}}%
\pgfkeys{
/pgfmanual/gray key prefixes={/pgfplots/table/create col/linear regression/},
/pdflinks/search key prefixes in/.add={/pgfplots/table/create col/linear regression/,}{},
}
A style for use in |\addplot table| which computes a linear (least squares) regression $y(x) = a \cdot x + b$ using the sample data $(x_i,y_i)$ which has to be specified inside of \meta{key-value-config} (see below).
It creates a new column on-the-fly which contains the values $y(x_i) = a \cdot x_i + b$. The values $a$ and $b$ will be stored (globally) into \declareandlabel{\pgfplotstableregressiona} and \declareandlabel{\pgfplotstableregressionb}.
\begin{codeexample}[]
\begin{tikzpicture}
\begin{axis}[legend pos=outer north east]
\addplot table {% plot X versus Y. This is original data.
X Y
1 1
2 4
3 9
4 16
5 25
6 36
};
\addplot table[
y={create col/linear regression={y=Y}}] % compute a linear regression from the input table
{
X Y
1 1
2 4
3 9
4 16
5 25
6 36
};
%\xdef\slope{\pgfplotstableregressiona} %<-- might be handy occasionally
\addlegendentry{$y(x)$}
\addlegendentry{%
$\pgfmathprintnumber{\pgfplotstableregressiona} \cdot x
\pgfmathprintnumber[print sign]{\pgfplotstableregressionb}$}
\end{axis}
\end{tikzpicture}
\end{codeexample}
The example above has two plots: one showing the data and one containing the |linear regression| line. We use |y={create col/linear regression={}}| here, which means to create a new column\footnote{The \texttt{y=\{create col/} feature is available for any other \PGFPlotstable\ postprocessing style, see the \texttt{create on use} documentation in the \PGFPlotstable\ manual.} containing the regression values automatically.
As arguments, we need to provide the $y$ column name explicitly\footnote{In fact, \PGFPlots\ sees that there are only two columns and uses the second by default. But you need to provide it if there are at least 3 columns.}. The $x$ value is determined from context: |linear regression| is evaluated inside of |\addplot table|, so it uses the same $x$ as |\addplot table| (i.e.\ if you write |\addplot table[x=|\marg{col name}|]|, the regression will also use \meta{col name} as its |x| input). Furthermore, it shows the line parameters $a$ and $b$ in the legend.
The following \meta{key-value-config} keys are accepted as comma--separated list:
\begin{key}{%
/pgfplots/table/create col/linear regression/table=\marg{\textbackslash macro {\normalfont or} file name} (initially empty)}
Provides the table from where to load the |x| and |y| columns. It defaults to the currently processed one, i.e.\ to the value of |\pgfplotstablename|.
\end{key}
\begin{keylist}{%
/pgfplots/table/create col/linear regression/x=\marg{column} (initially empty),
/pgfplots/table/create col/linear regression/y=\marg{column} (initially empty)}
Provides the source of $x_i$ and $y_i$ data, respectively. The argument \meta{column} is usually a column name of the input table, yet it can also contain |[index]|\meta{integer} to designate column indices (starting with $0$), |create on use| specifications or |alias|es (see the \PGFPlotstable\ manual for details on |create on use| and |alias|).
The initial configuration (an empty value) checks the context where the |linear regression| is evaluated. If it is evaluated inside of |\pgfplotstabletypeset|, it uses the first and second table columns. If it is evaluated inside of |\addplot table|, it uses the same $x$ input as the |\addplot table| statement. The |y| key needs to be provided explicitly (unless the table has only two columns).
\end{keylist}
\begin{keylist}{%
/pgfplots/table/create col/linear regression/xmode=\mchoice{auto,linear,log} (initially auto),
/pgfplots/table/create col/linear regression/ymode=\mchoice{auto,linear,log} (initially auto)}
Enables or disables processing of logarithmic coordinates. Logarithmic processing means to apply $\ln$ before computing the regression line and $\exp$ afterwards.
The choice |auto| checks if the column is evaluated inside of a \PGFPlots\ axis. If so, it uses the axis scaling of the embedding axis. Otherwise, it uses |linear|.
In case of logarithmic coordinates, the |log basis x| and |log basis y| keys determine the basis.
\begin{codeexample}[]
\begin{tikzpicture}
\begin{loglogaxis}
\addplot table[x=dof,y=error2]
{pgfplotstable.example1.dat};
\addlegendentry{$y(x)$}
\addplot table[
x=dof,
y={create col/linear regression={y=error2}}]
{pgfplotstable.example1.dat};
% might be handy occasionally:
%\xdef\slope{\pgfplotstableregressiona}
\addlegendentry{slope
$\pgfmathprintnumber{\pgfplotstableregressiona}$}
\end{loglogaxis}
\end{tikzpicture}
\end{codeexample}
The (commented) line containing |\slope| is explained above; it allows to remember different regression slopes in our example.
\end{keylist}
\begin{keylist}{%
/pgfplots/table/create col/linear regression/variance list=\marg{list} (initially empty),%
/pgfplots/table/create col/linear regression/variance=\marg{column name} (initially empty)%
}
Both keys allow to provide uncertainties (variances) to single data points.
A high (relative) variance indicates an unreliable data point, a value of $1$ is standard.
The |variance list| key allows to provide variances directly as comma--separated list, for example
|variance list={1000,1000,500,200,1,1}|.
The |variance| key allows to load values from a table \meta{column name}. Such a column name is (initially, see below) loaded from the same table where data points have been found. The \meta{column name} may also be a |create on use| name.
\begin{codeexample}[]
\begin{tikzpicture}
\begin{loglogaxis}
\addplot table[x=dof,y=error2]
{pgfplotstable.example1.dat};
\addlegendentry{$y(x)$}
\addplot table[
x=dof,
y={create col/linear regression={
y=error2,
variance list={1000,800,600,500,400}}
}
]
{pgfplotstable.example1.dat};
\addlegendentry{slope
$\pgfmathprintnumber{\pgfplotstableregressiona}$}
\end{loglogaxis}
\end{tikzpicture}
\end{codeexample}
If both, |variance list| and |variance| are given, the first one will be preferred. Note that it is not necessary to provide variances for every data point.
\end{keylist}
\begin{key}{/pgfplots/table/create col/linear regression/variance src=\marg{\textbackslash table {\normalfont or} file name} (initially empty)}
Allows to load the |variance| from another table. The initial setting is empty. It is acceptable if the |variance| column in the external table has fewer entries than expected, in this case, only the first ones will be used.
\end{key}
\end{stylekey}
\paragraph{Limitations:} Currently, \PGFPlots\ supports only linear regression, and it only supports regression together with |\addplot table|. Furthermore, long input tables might need quite some time.
}
|