PoseResnet : A Top-down Machine Learning Model for Pose Estimation
This is an introduction to「PoseResnet」, a machine learning model that can be used with ailia SDK. You can easily use this model to create AI applications using ailia SDK as well as many other ready-to-use ailia MODELS.
Overview
PoseResnet is a machine learning model developed by Microsoft Research as a baseline for single person pose estimation. After detecting a person, with for example YOLOv3, PoseResnet can be used to compute the skeleton of this person.
Top-down vs. bottom-up
Machine learning models to detect multi person skeletons can work in either a top-down approach or a bottom-up approach.
In the top-down approach, the person is detected using YOLOv3 or another similar model, and the key points are calculated using a single person skeleton detection model. It is highly accurate, but the load increases depending on the number of people.
The bottom-up approach recognizes multiple people at the same time by calculating keypoints and then grouping them together using PAF (Part Affinity Field) and other methods. OpenPose orLightWeightHumanPoseare a typical examples, they provide stable performances, but it may connect wrong keypoints in some cases.
Confidence and Part Affinity Fields
Whether you use the top-down or bottom-up approach, the machine learning model will output a heat map of confidence for key points. The heat map is designed to have a large value at the location of the key point and one can compute the location of keypoints by calculating the location of the largest values.

Input image (standard image database)

Confidence
In the case of the top-down approach, only one person is in the picture, so key points can be calculated from confidence only.
In contrast, in the bottom-up approach, multiple people are in the picture at the same time, and multiple set of keypoints are detected simultaneously. Therefore, each keypoint needs to be grouped and assigned to multiple people.
In the bottom-up approach, keypoints are grouped together based on Part Affinity Fields (PAF). PAF contains information that regarding the connections between keypoints. The keypoint assignment problem is solved by integrating the PAF values between keypoints and selecting the combination with the highest value.

Part Affinity Fields
Architecture
Since PoseResnet is a top-down approach, it computes the skeleton only using confidence data. PoseResnet was developed to serve as a baseline and has a simple architecture that combines a ResNet backbone combined with Deconvolution.

Source: https://arxiv.org/pdf/1804.06208.pdf
PoseResnet performs better than traditional architectures such as Hourglass and CPN on the COCO dataset.

Source: https://arxiv.org/pdf/1804.06208.pdf
CMU-Pose in the table below refers to the popular OpenPose.

Source: https://arxiv.org/pdf/1804.06208.pdf
Keypoint definition
PoseResnet, like OpenPose, outputs 18 keypoints in COCO format.

Source: https://github.com/CMU-Perceptual-Computing-Lab/openpose/blob/master/doc/media/keypoints_pose_18.png
Usage
In ailia SDK, you can apply person detection using YOLOv3 Tiny and pose estimation by PoseResnet to a web camera video stream with the following sample code. The model to be converted is pose_resnet_50_256x192.pth.tar
$ python3 pose_resnet.py -v 0
Below is an example of the result you can expect from PoseResnet.
Related topics
GAST : A machine learning model that predicts a 3D skeleton from a 2D skeletonmedium.com
ailia Inc. has developed ailia SDK, which enables cross-platform, GPU-based rapid inference.
ailia Inc. provides a wide range of services from consulting and model creation, to the development of AI-based applications and SDKs. Feel free to contact us for any inquiry.
ailia Tech BLOG