TY - JOUR AU - Ipkovich, Ádám AU - Héberger, Károly AU - Sebestyén, Viktor AU - Abonyi, János TI - Utility function-based generalization of sum of ranking differences–country-wise analysis of greenhouse gas emissions JF - ECOLOGICAL INDICATORS J2 - ECOL INDIC VL - 160 PY - 2024 PG - 18 SN - 1470-160X DO - 10.1016/j.ecolind.2024.111734 UR - https://m2.mtmt.hu/api/publication/34679063 ID - 34679063 LA - English DB - MTMT ER - TY - JOUR AU - Héberger, Károly TI - Frequent Errors in Modeling by Machine Learning: A Prototype Case of Predicting the Timely Evolution of COVID-19 Pandemic JF - ALGORITHMS J2 - ALGORITHMS VL - 17 PY - 2024 IS - 1 PG - 23 SN - 1999-4893 DO - 10.3390/a17010043 UR - https://m2.mtmt.hu/api/publication/34517630 ID - 34517630 AB - Background: The development and application of machine learning (ML) methods have become so fast that almost nobody can follow their developments in every detail. It is no wonder that numerous errors and inconsistencies in their usage have also spread with a similar speed independently from the tasks: regression and classification. This work summarizes frequent errors committed by certain authors with the aim of helping scientists to avoid them. Methods: The principle of parsimony governs the train of thought. Fair method comparison can be completed with multicriteria decision-making techniques, preferably by the sum of ranking differences (SRD). Its coupling with analysis of variance (ANOVA) decomposes the effects of several factors. Earlier findings are summarized in a review-like manner: the abuse of the correlation coefficient and proper practices for model discrimination are also outlined. Results: Using an illustrative example, the correct practice and the methodology are summarized as guidelines for model discrimination, and for minimizing the prediction errors. The following factors are all prerequisites for successful modeling: proper data preprocessing, statistical tests, suitable performance parameters, appropriate degrees of freedom, fair comparison of models, and outlier detection, just to name a few. A checklist is provided in a tutorial manner on how to present ML modeling properly. The advocated practices are reviewed shortly in the discussion. Conclusions: Many of the errors can easily be filtered out with careful reviewing. Every authors’ responsibility is to adhere to the rules of modeling and validation. A representative sampling of recent literature outlines correct practices and emphasizes that no error-free publication exists. LA - English DB - MTMT ER - TY - JOUR AU - Héberger, Károly TI - Selection of optimal validation methods for quantitative structure–activity relationships and applicability domain JF - SAR AND QSAR IN ENVIRONMENTAL RESEARCH J2 - SAR QSAR ENVIRON RES VL - 34 PY - 2023 IS - 5 SP - 415 EP - 434 PG - 20 SN - 1062-936X DO - 10.1080/1062936X.2023.2214871 UR - https://m2.mtmt.hu/api/publication/33897785 ID - 33897785 AB - This brief literature survey groups the (numerical) validation methods and emphasizes the contradictions and confusion considering bias, variance and predictive performance. A multicriteria decision-making analysis has been made using the sum of absolute ranking differences (SRD), illustrated with five case studies (seven examples). SRD was applied to compare external and cross-validation techniques, indicators of predictive performance, and to select optimal methods to determine the applicability domain (AD). The ordering of model validation methods was in accordance with the sayings of original authors, but they are contradictory within each other, suggesting that any variant of cross-validation can be superior or inferior to other variants depending on the algorithm, data structure and circumstances applied. A simple fivefold cross-validation proved to be superior to the Bayesian Information Criterion in the vast majority of situations. It is simply not sufficient to test a numerical validation method in one situation only, even if it is a well defined one. SRD as a preferable multicriteria decision-making algorithm is suitable for tailoring the techniques for validation, and for the optimal determination of the applicability domain according to the dataset in question. LA - English DB - MTMT ER - TY - JOUR AU - Abonyi, János AU - Ipkovich, Ádám AU - Dörgő, Gyula AU - Héberger, Károly TI - Matrix factorization-based multi-objective ranking–What makes a good university? JF - PLOS ONE J2 - PLOS ONE VL - 18 PY - 2023 IS - 4 PG - 30 SN - 1932-6203 DO - 10.1371/journal.pone.0284078 UR - https://m2.mtmt.hu/api/publication/33753835 ID - 33753835 AB - Non-negative matrix factorization (NMF) efficiently reduces high dimensionality for many -objective ranking problems. In multi-objective optimization, as long as only three or four conflicting viewpoints are present, an optimal solution can be determined by finding the Pareto front. When the number of the objectives increases, the multi-objective problem evolves into a many -objective optimization task, where the Pareto front becomes oversaturated. The key idea is that NMF aggregates the objectives so that the Pareto front can be applied, while the Sum of Ranking Differences (SRD) method selects the objectives that have a detrimental effect on the aggregation, and validates the findings. The applicability of the method is illustrated by the ranking of 1176 universities based on 46 variables of the CWTS Leiden Ranking 2020 database. The performance of NMF is compared to principal component analysis (PCA) and sparse non-negative matrix factorization-based solutions. The results illustrate that PCA incorporates negatively correlated objectives into the same principal component. On the contrary, NMF only allows non-negative correlations, which enable the proper use of the Pareto front. With the combination of NMF and SRD, a non-biased ranking of the universities based on 46 criteria is established, where Harvard, Rockefeller and Stanford Universities are determined as the first three. To evaluate the ranking capabilities of the methods, measures based on Relative Entropy (RE) and Hypervolume (HV) are proposed. The results confirm that the sparse NMF method provides the most informative ranking. The results highlight that academic excellence can be improved by decreasing the proportion of unknown open-access publications and short distance collaborations. The proportion of gender indicators barely correlate with scientific impact. More authors, long-distance collaborations, publications that have more scientific impact and citations on average highly influence the university ranking in a positive direction. LA - English DB - MTMT ER - TY - JOUR AU - Rácz, Anita AU - Mihalovits, Levente Márk AU - Bajusz, Dávid AU - Héberger, Károly AU - Miranda-Quintana, Ramón Alain TI - Molecular Dynamics Simulations and Diversity Selection by Extended Continuous Similarity Indices JF - JOURNAL OF CHEMICAL INFORMATION AND MODELING J2 - J CHEM INF MODEL VL - 62 PY - 2022 IS - 14 SP - 3415 EP - 3425 PG - 11 SN - 1549-9596 DO - 10.1021/acs.jcim.2c00433 UR - https://m2.mtmt.hu/api/publication/32992489 ID - 32992489 LA - English DB - MTMT ER - TY - JOUR AU - Gere, Attila AU - Szakál, Dorina AU - Héberger, Károly TI - Multiobject Optimization of National Football League Drafts: Comparison of Teams and Experts JF - APPLIED SCIENCES-BASEL J2 - APPL SCI-BASEL VL - 12 PY - 2022 IS - 13 PG - 14 SN - 2076-3417 DO - 10.3390/app12136303 UR - https://m2.mtmt.hu/api/publication/32898327 ID - 32898327 LA - English DB - MTMT ER - TY - JOUR AU - Orosz, Álmos AU - Héberger, Károly AU - Rácz, Anita TI - Comparison of Descriptor- and Fingerprint Sets in Machine Learning Models for ADME-Tox Targets JF - FRONTIERS IN CHEMISTRY J2 - FRONT CHEM VL - 10 PY - 2022 PG - 15 SN - 2296-2646 DO - 10.3389/fchem.2022.852893 UR - https://m2.mtmt.hu/api/publication/32867980 ID - 32867980 LA - English DB - MTMT ER - TY - JOUR AU - Rácz, Anita AU - Dunn, Timothy B. AU - Bajusz, Dávid AU - Kim, Taewon D. AU - Miranda-Quintana, Ramón Alain AU - Héberger, Károly TI - Extended continuous similarity indices: theory and application for QSAR descriptor selection JF - JOURNAL OF COMPUTER-AIDED MOLECULAR DESIGN J2 - J COMPUT AID MOL DES VL - 36 PY - 2022 IS - 3 SP - 157 EP - 173 PG - 17 SN - 0920-654X DO - 10.1007/s10822-022-00444-7 UR - https://m2.mtmt.hu/api/publication/32741735 ID - 32741735 LA - English DB - MTMT ER - TY - JOUR AU - Abonyi, János AU - Czvetkó, Tímea AU - Kosztyán, Zsolt Tibor AU - Héberger, Károly TI - Factor analysis, sparse PCA, and Sum of Ranking Differences-based improvements of the Promethee-GAIA multicriteria decision support technique JF - PLOS ONE J2 - PLOS ONE VL - 17 PY - 2022 IS - 2 PG - 25 SN - 1932-6203 DO - 10.1371/journal.pone.0264277 UR - https://m2.mtmt.hu/api/publication/32710274 ID - 32710274 N1 - Megjelenés: 2022. február LA - English DB - MTMT ER - TY - GEN AU - Sziklai, Balázs AU - Baranyi, Máté AU - Héberger, Károly TI - Testing Rankings with Cross-Validation PY - 2021 PG - 27 UR - https://m2.mtmt.hu/api/publication/32732679 ID - 32732679 LA - English DB - MTMT ER -