引用本文:汤健,崔璨麟,王丹丹,乔俊飞.基于约简特征概率密度分布的虚拟样本生成[J].控制理论与应用,2024,41(11):2165~2173.[点击复制]
TANG Jian,CUI Can-lin,WANG Dan-dan,QIAO Jun-fei.Virtual sample generation method using reduced feature probability density distribution[J].Control Theory and Technology,2024,41(11):2165~2173.[点击复制]
基于约简特征概率密度分布的虚拟样本生成
Virtual sample generation method using reduced feature probability density distribution
摘要点击 143  全文点击 26  投稿时间:2022-03-24  修订日期:2023-04-18
查看全文  查看/发表评论  下载PDF阅读器
DOI编号  10.7641/CTA.2023.20210
  2024,41(11):2165-2173
中文关键词  虚拟样本生成  主成分分析  概率密度分布  核密度估计  综合学习粒子群  混合建模样本
英文关键词  virtual sample generation  principal component analysis  probability density distribution  kernel density estimation  comprehensive learning particle swarm  mixed modeling sample
基金项目  国家自然科学基金项目(62073006, 62021003), 科技创新2030–“新一代人工智能”重大项目(2021ZD0112301, 2021ZD0112302)资助.
作者单位E-mail
汤健* 北京工业大学信息学部 tjian001@126.com 
崔璨麟 北京工业大学信息学部  
王丹丹 北京工业大学信息学部  
乔俊飞 北京工业大学信息学部  
中文摘要
      复杂工业过程的产品质量和环保指标等难测参数的建模数据具有样本小、分布稀疏等特性. 对此, 本文提出了基于约简特征概率密度分布(PDF)的虚拟样本生成(VSG)方法进行建模数据扩充. 首先, 采用主成分分析(PCA)对小样本数据进行特征约简, 并对所得独立主成分进行核密度估计(KDE)以生成候选虚拟主成分, 再正交采样后通过重构获得虚拟样本输入. 接着, 为均衡映射模型的精度与随机性, 采用随机森林(RF)与随机权神经网络(RWNN)构建集成映射模型获得虚拟样本输出. 最后, 对影响虚拟样本“优劣”的主成分贡献率、KDE平滑指数、候选虚拟主成分、虚拟样本数量、映射模型学习参数及集成权重等参数, 采用综合学习粒子群优化(CLPSO)算法进行优化以获得最优虚拟样本. 通过基准数据集和城市固废焚烧过程二噁英(DXN)数据集验证了所提VSG方法的合理性及有效性.
英文摘要
      The modeling data of difficult-to-measure parameters such as industrial process quality indicators and environmental indicators have characteristics of small samples and sparse distribution. A new virtual sample generation (VSG) method based on probability density distribution (PDF) of reduced features is proposed for modeling data augmentation. Firstly, the principal component analysis (PCA) is used to reduce the feature dimension and the kernel density estimation (KDE) is performed on the obtained independent principal components to generate candidate virtual principal components. By using orthogonally sampling approach, the obtained virtual principal components are used to re-construct the inputs of virtual sample. Then, in order to balance the accuracy and randomness of the mapping model, an ensemble mapping model is constructed by using random forest (RF) and random weight neural network (RWNN) to obtain the outputs of virtual samples. Finally, principal component contribution rate, KDE smoothing index, number of candidate virtual principal components and virtual samples, mapping model parameters and ensemble weights that affect the quality of virtual samples are selected by comprehensive learning particle swarm optimization (CLPSO) algorithm for obtaining the optimized virtual samples. The experimental results on benchmark dataset and dioxin (DXN) datasets of municipal solid waste incineration process show the rationality and effectiveness of the proposed method.