Applied datasets can vary from a few hundred to thousands of samples in typical quantitative
structure-activity/property (QSAR/QSPR) relationships and classification. However,
the size of the datasets and the train/test split ratios can greatly affect the outcome
of the models, and thus the classification performance itself. We compared several
combinations of dataset sizes and split ratios with five different machine learning
algorithms to find the differences or similarities and to select the best parameter
settings in nonbinary (multiclass) classification. It is also known that the models
are ranked differently according to the performance merit(s) used. Here, 25 performance
parameters were calculated for each model, then factorial ANOVA was applied to compare
the results. The results clearly show the differences not just between the applied
machine learning algorithms but also between the dataset sizes and to a lesser extent
the train/test split ratios. The XGBoost algorithm could outperform the others, even
in multiclass modeling. The performance parameters reacted differently to the change
of the sample set size; some of them were much more sensitive to this factor than
the others. Moreover, significant differences could be detected between train/test
split ratios as well, exerting a great effect on the test validation of our models.