PLS,
which was originally developed for IR- and UV-spectroscopy owes its wide application
to its speed, robustness and user-friendliness. PLS performs a linear regression
in a new coordinate system with a lower dimensionality than the original space
of the independent variables. The new coordinates are called PLS factors or
principal components (the latter is less correct but widely used analogous to
the principal components for the PCA). The principal components are determined
by the maximum variance of the independent variables and by a maximum correlation
with the dependent variable(s). There are as many principal components as variables
to predict, but for the actual model only the primary, most important principal
components are used. This makes PLS robust to noise, as in theory the noise
should be encapsulated in the less important principal components and the information
of interest should be represented by the primary principal components. The actual
regression is performed in the space spanned by the new reduced coordinate system
of the orthogonal principal components. Different criteria exist for the number
of principal components to be used: One of the easiest methods compares the
corresponding eigenvalue of the principal component with the eigenvalues of
the higher components using an F-test [30],[107].
Eastman et al. [31]
used a crossvalidation method for the determination of the optimal number of
principal components, which corresponds to the minimal predicted residual sum
of squares. This criterion is more conservative in terms of the number of principal
components and is widespread in literature. Recently, Martens et al. introduced
the Martens' Uncertainty Test [32],[33],
which uses a jackknifing procedure with many sub-models to determine the significant
variables and the optimal number of principal components, which are found in
an iterative procedure with an elimination of instable variables. Compared with
the crossvalidation criterion, the number of principal components is biased
towards a lower number rendering this criterion very conservative for the selection
of the principal components (see also section 6.1). Other
methods use knowledge of the size of the measurement error for the estimation
of the optimal number of principal components [34]
or add artificial noise to the data and determine the optimal number of components
by comparing with the original data in a bootstrapping procedure [35].
In this study, the Martens' Uncertainty Test and the minimum crossvalidation
error criterion are used.
In mathematical
terms, PLS can be described as follows: The matrixes X and Y
of the independent and dependent variables are decomposed according to
(8)
(9)
with
E and F
as residual matrices, T and U as score matrixes and and as loading matrixes.
For the decomposition, either a singular value decomposition (SVD) or the nonlinear
iterative partial least squares (NIPALS) algorithm can be used [36]-[41]
A linear model is assumed to relate the score matrixes T and U
(with H as residual matrix and
B as diagonal matrix):
(10)
The PLS1
algorithm models one variable y
at a time, whereas the PLS2 algorithm can model several variables in one run.
In this study, the PLS1 algorithm is used, as multiple PLS1 models often perform
better than a single PLS2 model [41],[42]. Further
details of the algorithms can be found in [39],[41],[43]-[45].