基于多层卷积神经网络特征和双向长短时记忆单元的行为识别

葛瑞; 王朝晖; 徐鑫; 季怡; 刘纯平; 龚声蓉

引用本文:	葛瑞,王朝晖,徐鑫,季怡,刘纯平,龚声蓉.基于多层卷积神经网络特征和双向长短时记忆单元的行为识别[J].控制理论与应用,2017,34(6):790~796.[点击复制]
	GE Rui,WANG Zhao-hui,XU Xin,JI Yi,LIU Chun-ping,GONG Sheng-rong.Action recognition with hierarchical convolutional neural networks features and bi-directional long short-term memory model[J].Control Theory and Technology,2017,34(6):790~796.[点击复制]

基于多层卷积神经网络特征和双向长短时记忆单元的行为识别

Action recognition with hierarchical convolutional neural networks features and bi-directional long short-term memory model

摘要点击 3925 全文点击 3072 投稿时间：2016-08-12 修订日期：2017-05-26

查看全文查看/发表评论下载PDF阅读器

DOI编号 10.7641/CTA.2017.60607

2017,34(6):790-796

中文关键词行为识别卷积神经网络递归神经网络双向递归神经网络

英文关键词 action recognition convolutional neural networks recurrent neural networks bi-directional recurrent neural networks

基金项目国家自然科学基金;省自然科学基金

作者	单位	E-mail
葛瑞	苏州大学	forgerui@163.com
王朝晖	苏州大学
徐鑫	苏州大学
季怡	苏州大学
刘纯平	苏州大学吉林大学符号计算与知识工程教育部重点实验室软件新技术与产业化协同创新中心
龚声蓉^*	常熟理工学院苏州大学	shrgong@suda.edu.cn

中文摘要

鲁棒的视频行为识别由于其复杂性成为了一项极具挑战的任务. 如何有效提取鲁棒的时空特征成为解决问题的关键. 在本文中, 提出使用双向长短时记忆单元(Bi--LSTM)作为主要框架去捕获视频序列的双向时空特征. 首先, 为了增强特征表达, 使用多层的卷积神经网络特征代替传统的手工特征. 多层卷积特征融合了低层形状信息和高层语义信息, 能够捕获丰富的空间信息. 然后, 将提取到的卷积特征输入Bi--LSTM, Bi--LSTM包含两个不同方向的LSTM层. 前向层从前向后捕获视频演变, 后向层反方向建模视频演变. 最后两个方向的演变表达融合到Softmax中, 得到最后的分类结果. 在UCF101和HMDB51数据集上的实验结果显示本文的方法在行为识别上可以取得较好的性能.

英文摘要

Robust action recognition in videos is a challenging task due to its complexity. To solve it, how to effectively capture the robust spatio-temporal features becomes very important. In this paper, we propose to exploit bi-directional long short-term memory (Bi--LSTM) model as main framework to capture bi-directional spatio-temporal features. First, in order to boost our feature representations, the traditional hand-crafted descriptors are replaced by the extracted hierarchical convolutional neural network features. The multiple convolutional layer features fuse the information of low level basic shapes and high level semantic contents to get powerful spatial features. Then, the extracted convolutional features are fed into Bi--LSTM which has two different directional LSTM layers. The forward layer captures the evolution from front to back over video time and the backward layer models the opposite directional evolution. The two directional representations of evolution are then fused into Softmax to get final classification result. The experiments on UCF101 and HMDB51 datasets show that our method can achieve comparable performance with the state of the art methods for action recognition.