The Activity Prediction Problem with Imbalanced Datasets: A Comparative Study

Abstract: Many chemical problems consist of large databases, which might result in highly imbalanced datasets where the majority class outnumbers the minority class. Building strong classification models is a great challenge because traditional machine learning algorithms tend to focus on the majority class. Maximizing the overall accuracy may result in poor prediction of the minority class. This work conducts a comparative study of forty-three class-imbalance algorithms to address the issue of quantitative relation-ship modeling of molecular activity and structure (QSAR) on mbalanced datasets. We used different statistical tests, specifically designed to compare multiple algorithms on multiple datasets, to evaluate the performance of the solutions and determine the best algorithms.

Nanuel Mendoza-Hurtado, Nicolás García-Pedrajas, Ramón Carrasco Velar,
and Gonzalo Cerruela-García, submitted.

Source code: