基于约简特征概率密度分布的虚拟样本生成

汤健; 崔璨麟; 王丹丹; 乔俊飞

引用本文:	汤健,崔璨麟,王丹丹,乔俊飞.基于约简特征概率密度分布的虚拟样本生成[J].控制理论与应用,2024,41(11):2165~2173.[点击复制]
	TANG Jian,CUI Can-lin,WANG Dan-dan,QIAO Jun-fei.Virtual sample generation method using reduced feature probability density distribution[J].Control Theory and Technology,2024,41(11):2165~2173.[点击复制]

基于约简特征概率密度分布的虚拟样本生成

Virtual sample generation method using reduced feature probability density distribution

摘要点击 2866 全文点击 59 投稿时间：2022-03-24 修订日期：2023-04-18

查看全文查看/发表评论下载PDF阅读器

DOI编号 10.7641/CTA.2023.20210

2024,41(11):2165-2173

中文关键词虚拟样本生成主成分分析概率密度分布核密度估计综合学习粒子群混合建模样本

英文关键词 virtual sample generation principal component analysis probability density distribution kernel density estimation comprehensive learning particle swarm mixed modeling sample

基金项目国家自然科学基金项目(62073006, 62021003), 科技创新2030–“新一代人工智能”重大项目(2021ZD0112301, 2021ZD0112302)资助.

作者	单位	E-mail
汤健^*	北京工业大学信息学部	tjian001@126.com
崔璨麟	北京工业大学信息学部
王丹丹	北京工业大学信息学部
乔俊飞	北京工业大学信息学部

中文摘要

复杂工业过程的产品质量和环保指标等难测参数的建模数据具有样本小、分布稀疏等特性. 对此, 本文提出了基于约简特征概率密度分布(PDF)的虚拟样本生成(VSG)方法进行建模数据扩充. 首先, 采用主成分分析(PCA)对小样本数据进行特征约简, 并对所得独立主成分进行核密度估计(KDE)以生成候选虚拟主成分, 再正交采样后通过重构获得虚拟样本输入. 接着, 为均衡映射模型的精度与随机性, 采用随机森林(RF)与随机权神经网络(RWNN)构建集成映射模型获得虚拟样本输出. 最后, 对影响虚拟样本“优劣”的主成分贡献率、KDE平滑指数、候选虚拟主成分、虚拟样本数量、映射模型学习参数及集成权重等参数, 采用综合学习粒子群优化(CLPSO)算法进行优化以获得最优虚拟样本. 通过基准数据集和城市固废焚烧过程二噁英(DXN)数据集验证了所提VSG方法的合理性及有效性.

英文摘要

The modeling data of difficult-to-measure parameters such as industrial process quality indicators and environmental indicators have characteristics of small samples and sparse distribution. A new virtual sample generation (VSG) method based on probability density distribution (PDF) of reduced features is proposed for modeling data augmentation. Firstly, the principal component analysis (PCA) is used to reduce the feature dimension and the kernel density estimation (KDE) is performed on the obtained independent principal components to generate candidate virtual principal components. By using orthogonally sampling approach, the obtained virtual principal components are used to re-construct the inputs of virtual sample. Then, in order to balance the accuracy and randomness of the mapping model, an ensemble mapping model is constructed by using random forest (RF) and random weight neural network (RWNN) to obtain the outputs of virtual samples. Finally, principal component contribution rate, KDE smoothing index, number of candidate virtual principal components and virtual samples, mapping model parameters and ensemble weights that affect the quality of virtual samples are selected by comprehensive learning particle swarm optimization (CLPSO) algorithm for obtaining the optimized virtual samples. The experimental results on benchmark dataset and dioxin (DXN) datasets of municipal solid waste incineration process show the rationality and effectiveness of the proposed method.