10.2. Stopping Criteria for the Parallel Frameworks (Dr. Frank Dieterle)

Frank Dieterle

Ph. D. Thesis

10. Results – Various Aspects of the Frameworks and Measurements

10.2. Stopping Criteria for the Parallel Frameworks

Home
News
About Me
Ph. D. Thesis
	Abstract
	Table of Contents
	1. Introduction
	2. Theory – Fundamentals of the Multivariate Data Analysis
	3. Theory – Quantification of the Refrigerants R22 and R134a: Part I
	4. Experiments, Setups and Data Sets
	5. Results – Kinetic Measurements
	6. Results – Multivariate Calibrations
	7. Results – Genetic Algorithm Framework
	8. Results – Growing Neural Network Framework
	9. Results – All Data Sets
	10. Results – Various Aspects of the Frameworks and Measurements
		10.1. Single or Multiple Analyte Rankings
		10.2. Stopping Criteria for the Parallel Frameworks
		10.3. Optimization of the Measurements
		10.4. Robustness and Comparison with Martens' Uncertainty Test
	11. Summary and Outlook
	12. References
	13. Acknowledgements
Publications
Research Tutorials
Downloads and Links
Contact
Search
Site Map
Print this Page

10.2. Stopping Criteria for the Parallel Frameworks

In the second step of the genetic algorithm framework and of the parallel growing neural network framework, variables are added to the model until the prediction errors evaluated by a subsampling procedure does not improve any more. A difficult question is, how the significance of this improvement should be judged (stopping criterion for the addition of variables). Many different approaches can be found in literature, which can be classified into several categories like significance tests or numerical comparisons, robust or non-robust tests, paired or non-paired tests and local or global error minima.

In this study, 6 different methods were used to determine the optimal number of variables. First of all, the simple numerical mean prediction errors of the subsampled test data were compared before and after the addition of a variable. Thereby the addition of variables was stopped when the first local minimum of the mean prediction error was found. A second approach calculates all mean prediction errors and uses the number of variables, which corresponds to the global minimum of the prediction errors. Commonly used methods to judge the improvement of predictions are based on statistical significance tests. An overview of the different tests can be found in [232]. The significance tests were implemented in the frameworks for stopping the addition of variables when the test determines the improvement of the predictions after the addition being not significant (see also figure 44 and figure 53). The most popular statistical test to compare the predictions of the subsampled test data before and after the addition of a variable is the Student T-test [254]. The T-test needs a normal distribution of the prediction errors (this can be checked by a Kolmogorov Smirnov test [38]) and thus is sensitive to outliers. A robust option for comparing the predictions of the test data subsets is the Kruskal Wallis Anova [262],[263], which corresponds to the Man-Whitney U-test, as only two groups are compared. If the partitioning of the subsampling procedure is reproducible for each addition of a variable (this means that the same test subsets are predicted during each loop of the variable addition) paired significance tests can be used like the paired T-test for normally distributed prediction errors and the Wilcoxon signed rank test as its robust counterpart [103],[264],[265]. The different categories of the significance tests have different requirements, which can be summarized as follows. In contrast to robust tests, T-tests need a normal distribution of the prediction errors and thus are sensitive to biases and outliers whereas robust tests are less powerful in terms of detecting differences of the prediction abilities. The paired tests need the same partitioning of the data into calibration and test subsets for each loop of the variable addition step. In contrast to finding the global minimum of the prediction errors, the implementation of the significance tests needs only as many loops as improvements of the prediction errors are observed.

The number of variables selected depends not only on the method, but also on the significance level of the statistical test, which was set to 5 % error probability for all tests. In principle, the different methods can be divided into 4 groups according to the number of variables selected. The T-test, the Kruskal Wallis Anova and the Wilcoxon signed rank test were most conservative in terms of selecting variables. These three tests selected the same small number of variables for all data sets under investigation in this work. The paired T-test selected in most cases some more variables followed by the criterion of the first local minimum of the prediction errors whereas the method of the global minimum of the prediction errors generally corresponded to more variables. All these methods are based on the prediction errors of the subsampled test data and not on an external data set. The question is how the prediction errors of the subsampled data correspond with the prediction errors of external validation data. The answer can be found in the so-called biasing [11], which means that when the same data are used for a model building process and for the variable selection process, the variable selection is biased towards selecting too many variables whereby the bias increases with a decreasing number of samples. As the subsampled data are used several times for both processes, the optimal method depends on the sample size. For large data sets, the global minimum of the prediction errors of the subsampled test data corresponds with the smallest prediction errors of external validation data whereas for smaller data sets methods that are more conservative correspond with the best errors of the validation data. This effect could be observed for all data sets under investigation. For the rather large refrigerant data set (441 samples for the calibration of only 2 analytes), the optimal method was the first local minimum criterion, whereas for the small quaternary mixtures (256 samples for the calibration of 4 analytes) and the ternary mixtures (245 samples for 3 analytes) the optimal method was the Kruskal Wallis Anova.

Although the selection of the stopping criterion influences the prediction ability of the frameworks, an investigation using all data sets of this work showed that the selection is less critical than supposed at first glance. Among all data sets, the highest difference of the prediction errors of external validation data was 0.4% when using different stopping criteria for the calibration data. The general recommendation of measuring as many samples as possible renders a sophisticated recommendation for a stopping criterion rather unnecessary, as in the case of not too small data sets, the local or global minimum criteria are adequate.

Page 138