Partial random under/over-sampling for multi-label problems

Abstract: Many current data mining applications address problems with instances that belong to more than one class. The term multi-label classification has been introduced as a way of describing this task. Advantageously using the correlation among the labels can provide better performance than methods that manage each label separately. One of the major challenges in multi-label datasets is the class-imbalance problem. In most cases, several or many of the labels are sparsely populated producing heavily imbalanced datasets. Standard methods used for single-label class-imbalanced datasets are not easily applicable due to the lack of a proper concept of minority instance in the multi-label case. In this paper we propose a new approach based on partial under-sampling and/or over-sampling of the instances that is more suitable to the multi-label case. The method modifies the concept of under-sampling and over-sampling to carry out a per-label approach, under-sampling or over-sampling instances only partially. In a large set of 55 real-world multi-label problems, our approach improves the results of current methods for dealing with class-imbalanced datasets in multi-label problems.

N. García-Pedrajas

Detailed results and Figures: