Effect of Dataset Size and Train/Test Split Ratios in QSAR/QSPR Multiclass Classification

Rácz, Anita [Rácz, Anita (kémia, kemometria...), szerző] Anyag- és Környezetkémiai Intézet (TTK); Plazmakémiai Kutatócsoport (HRN TTK / AKI); Bajusz, Dávid [Bajusz, Dávid (gyógyszerkémia, m...), szerző] Gyógyszerkémiai Kutatócsoport (HRN TTK / SZKI); Héberger, Károly ✉ [Héberger, Károly (Kemometria, többv...), szerző] Anyag- és Környezetkémiai Intézet (TTK); Plazmakémiai Kutatócsoport (HRN TTK / AKI)

Angol nyelvű Szakcikk (Folyóiratcikk) Tudományos
Megjelent: MOLECULES 1431-5157 1420-3049 26 (4) Paper: 1111 , 17 p. 2021
  • SJR Scopus - Pharmaceutical Science: Q1
Azonosítók
Szakterületek:
  • Elméleti és számítási kémia
  • Gyógyszerkémia
  • Kémiai tudományok
Applied datasets can vary from a few hundred to thousands of samples in typical quantitative structure-activity/property (QSAR/QSPR) relationships and classification. However, the size of the datasets and the train/test split ratios can greatly affect the outcome of the models, and thus the classification performance itself. We compared several combinations of dataset sizes and split ratios with five different machine learning algorithms to find the differences or similarities and to select the best parameter settings in nonbinary (multiclass) classification. It is also known that the models are ranked differently according to the performance merit(s) used. Here, 25 performance parameters were calculated for each model, then factorial ANOVA was applied to compare the results. The results clearly show the differences not just between the applied machine learning algorithms but also between the dataset sizes and to a lesser extent the train/test split ratios. The XGBoost algorithm could outperform the others, even in multiclass modeling. The performance parameters reacted differently to the change of the sample set size; some of them were much more sensitive to this factor than the others. Moreover, significant differences could be detected between train/test split ratios as well, exerting a great effect on the test validation of our models.
Hivatkozás stílusok: IEEEACMAPAChicagoHarvardCSLMásolásNyomtatás
2025-05-22 09:40