以类重叠度为优化目标的不平衡数据学习方法

孙博; 周倩; 陈海燕

引用本文:	孙博,周倩,陈海燕.以类重叠度为优化目标的不平衡数据学习方法[J].控制理论与应用,2024,41(11):2139~2146.[点击复制]
	SUN Bo,ZHOU Qian,CHEN Hai-Yan.Imbalanced data learning approach with class overlap degree as the optimization goal[J].Control Theory and Technology,2024,41(11):2139~2146.[点击复制]

以类重叠度为优化目标的不平衡数据学习方法

Imbalanced data learning approach with class overlap degree as the optimization goal

摘要点击 131 全文点击 34 投稿时间：2022-02-20 修订日期：2024-08-11

查看全文查看/发表评论下载PDF阅读器

DOI编号 10.7641/CTA.2023.20123

2024,41(11):2139-2146

中文关键词分类类不平衡欠采样类重叠度数据复杂性机器学习

英文关键词 classification class imbalance undersampling class overlap degree data complexity machine learning

基金项目山东省自然科学基金项目(ZR2023MF098, ZR2018QF002), 山东省重大科技创新项目(2019JZZY010706)资助.

作者	单位	E-mail
孙博^*	山东农业大学	sunbo87@126.com
周倩	山东农业大学
陈海燕	南京航空航天大学

中文摘要

分类是机器学习中的一项重要学习任务, 基本思想是使用在训练样例集上生成的分类器对测试样例的类别进行预测. 然而, 很多实际应用中的训练集具有不平衡的类分布, 这通常会制约学习算法的分类性能. 为此, 本文提出以类重叠度为优化目标的不平衡数据学习方法 (COA-RBU). 将相对类间势作为多数类样例效用的评价标准,并根据训练集的类重叠度自适应地确定合适欠采样比例, 以降低不平衡训练集的数据复杂性. 实验结果表明, 类重叠度能较好地反映数据集的学习难度, 并且COA-RBU具有良好的性能和较高的效率. 因此, 本文工作从类重叠数据复杂性角度为合适欠采样比例的确定提供了一种新的思路.

英文摘要

Classification is an important learning task in machine learning, and it predicts the class label of a test example by employing a classifier that is learned on the training examples set. However, in many practical applications, the collected training sets have imbalanced class distribution, which usually hinders the classification performance of most classifier learning algorithms. To alleviate this problem, an imbalanced data learning approach with class overlap degree as the optimization goal (COA-RBU) is proposed in this paper. It utilizes the mutual class potential to evaluate the utility of each majority class example, and adaptively determines the proper undersampling ratio according to the class overlap degree of a training set, aiming to decrease the data complexity of the imbalanced training set. Exprimental results indicate that the class overlap degree can well reflect the learning difficulty of an imbalanced dataset, and the proposed approach COA-RBU is effective and efficient. Therefore, this work provides a novel idea for determining the proper undersampling ratio from the perspective of class overlap data complexity.