Abstract. This article provides a review of the literature and existing research in recent years on the topic, describes the tasks associated with recognizing and predicting human movements
The task of recognizing and predicting human movements
Download 92.98 Kb.
|
Maqola 3
The task of recognizing and predicting human movements
Action recognition in video remains a challenging task due to many challenges including large intra-class variation, unclear boundary between classes, viewpoint, occlusion (shading), appearance, environmental influences and recording quality, particularly in realistic videos. Moreover, in order to have a complete system for recognizing human actions, it is necessary to take into account other disciplines, such as psychology. Human pose recognition is a computer vision technology that is based on detecting and analyzing a person’s position in space. The classic approach to effectively estimating human pose is to use a framework that visually models the human body and its movements, representing it as a set of certain components. There are three most commonly used approaches: skeleton-based model, contour-based model, and volume-based model. [5] The skeletal model is based on a set of coordinates of key points such as ankles, knees, shoulders, elbows, wrists and other joints that collectively make up the skeletal structure of the human body. Due to its versatility and flexibility, this model is used for both 2D and 3D pose recognition methods. 2D estimation is based on detecting and analyzing the (x,y) coordinates of human body joints in an RGB image. For 3D evaluation, z coordinates are added. A 2D model is used in this work. The contour model was used extensively in the past and was based on a contour obtained by dividing parts of the torso into rectangles and knowing their length and width. The simplest volume-oriented models were similar to contour ones, only instead of two-dimensional figures they used three-dimensional ones (cylinders, cones, and so on). Modern models usually take the form of a mesh obtained using 3D scanning methods. Technological advances in computer science and engineering enable computer systems to understand human actions depicted in video. In the field of computer vision, there are two main problems associated with this: recognizing and predicting human actions based on acquired video data. And if action recognition requires data containing the complete process of the action, then action prediction is based on identifying the cause of its occurrence based on incomplete video/sensory data. [6] Activities often appear to be typical indoor activities such as walking, talking, standing and sitting. They can also be more focused, such as those performed in a kitchen or factory. The video data set used in this preprint included activities that might be part of working in an automobile manufacturing plant: lifting and lowering objects (right and wrong), working in an inclined position, overhead, above a table, and so on. Action prediction is the result of classifying incomplete input data into an action yet to come. One of the subtasks is action anticipation, that is, its recognition when not a single fragment of the action has yet been observed, and the classification is entirely based on observed contextual clues. The other is early action prediction based on the observed part of it. Both are classification problems, but prediction often requires a temporally annotated dataset and a clear division between a "before action" segment and a "during action" segment for action anticipation, or between "initial action" and "final action" for early prediction. actions. Temporal Activity Prediction is the process of dividing the input video into segments (sequential series of frames) of action and inaction by specifying the beginning and end markers of each fragment of action. Localization/temporal action detection is the process of making an assumption about the timing of the action and classification of each action. In our work there is a video for all marking, where for each frame it is noted whether a certain action occurs in it or not. There are many frameworks for working with video data. The preprint uses an approach to determine key points based on recognizing 2D coordinates in video using the OpenPose framework. Recurrent neural networks (RNN) The second most common artificial neural network architecture used to understand the type of activity in images is the recurrent neural network (RNN). RNNs use a directed graph approach to process sequential input data, such as time data. This makes them valuable to understanding activity because frames (or extracted vectors based on frames) can be supplied as input data. The most common type of RNN is LSTM (Long Short-Term Memory) networks. The LSTM cell uses an input/output gate structure to perform long-term learning. The second most common type of RNN is the Gated Recurrent Unit (GRU). The GRU cell uses a reset/update gate structure to perform less computationally intensive training than LSTM cells.[7] Download 92.98 Kb. Do'stlaringiz bilan baham: |
Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©fayllar.org 2024
ma'muriyatiga murojaat qiling
ma'muriyatiga murojaat qiling