Abstract. This article provides a review of the literature and existing research in recent years on the topic, describes the tasks associated with recognizing and predicting human movements
Download 92.98 Kb.
|
Maqola 3
Review of existing studies. At the time of writing, there are many studies on the task of predicting movements in video using deep learning methods. In [1], a simple pipeline is proposed to classify and localize actions in videos without cropping. The authors combine the capabilities of both 3D CNN (3D Convolutional Neural Network) and RNN (3D Recurrent Neural Network) into a single neural network structure. Output from 3D model CNNs are taken as input to 3D-RNN. The network processes a 16-frame clip with a sequence of features from the video and returns a sequence of class probabilities. The model is trained on the ActivityNet Challenge 2016 dataset (640 hours of video). During the work, mAP 0.5874 and mAP 0.2237 were achieved in the tasks of classification and action localization, respectively.[1]
In [2], the problem of accurate localization of action key frames in video is considered. The authors of the work described loss functions that would reduce the number of false positive predictions. Structured loss is based on the best match between predicted and labeled action onsets. Auxiliary functions are described: Matching Loss, Wasserstein/EMD Loss, Per-Frame Loss, Combined Loss. A recurrent neural network is used to minimize the structured loss using gradient descent. The functionality of the functions is tested on The Mouse Reach Dataset (the video starts with the mouse reaching for the tablet and ends when it eats it) and THUMOS’14 (a large set of videos with marked actions). For the Mouse Reach dataset, Wasserstein/EMD Loss was found to be easier to optimize. However, for the THUMOS’14 dataset, Matching Loss showed the best results. In [3], the authors study the problem of action detection using streaming skeletal data. They propose a multi-task recurrent neural network, Joint Classification-Regression, for better action recognition and localization. Using classification and regression optimization tasks together, this network is able to automatically find the start and end of actions more accurately. In particular, taking advantage of the deep Long Short-Term Memory (LSTM) subnetwork, the proposed model automatically captures complex long-term temporal dynamics, which avoids the typical sliding window design and achieves high computational efficiency. In addition, the regression optimization subtask makes it possible to predict an action before it occurs. It uses a publicly available dataset, G3D (Gaming Action Dataset), on which the authors say the model shows promising performance. The authors of [4] solve the problem of recognizing sequences of human actions from a video stream. The goal is to prove the importance of detecting the starting point and subsequently propose a method for determining the start of the current action. The method is based on a bidirectional neural network (Bidirectional LSTM), which calculates the probability that a frame will become a starting point by comparing the dynamics of actions before and after the frame. Experiments on three datasets (the main dataset is the Montalbano Gesture dataset) showed that the method can reliably detect the starting point of an action, improving the accuracy of early recognition. Download 92.98 Kb. Do'stlaringiz bilan baham: |
Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2024
ma'muriyatiga murojaat qiling
ma'muriyatiga murojaat qiling