PoseResnet : A Top-down Machine Learning Model for Pose Estimation

This is an introduction to「PoseResnet」, a machine learning model that can be used with ailia SDK. You can easily use this model to create AI applications using ailia SDK as well as many other ready-to-use ailia MODELS.

Overview

PoseResnet is a machine learning model developed by Microsoft Research as a baseline for single person pose estimation. After detecting a person, with for example YOLOv3, PoseResnet can be used to compute the skeleton of this person.

Simple Baselines for Human Pose Estimation and TrackingThere has been significant progress on pose estimation and increasing interests on pose tracking in recent years. At…arxiv.org

microsoft/human-pose-estimation.pytorchThis is an official pytorch implementation of Simple Baselines for Human Pose Estimation and Tracking . This work…github.com

Top-down vs. bottom-up

Machine learning models to detect multi person skeletons can work in either a top-down approach or a bottom-up approach.

In the top-down approach, the person is detected using YOLOv3 or another similar model, and the key points are calculated using a single person skeleton detection model. It is highly accurate, but the load increases depending on the number of people.

The bottom-up approach recognizes multiple people at the same time by calculating keypoints and then grouping them together using PAF (Part Affinity Field) and other methods. OpenPose orLightWeightHumanPoseare a typical examples, they provide stable performances, but it may connect wrong keypoints in some cases.

Confidence and Part Affinity Fields

Whether you use the top-down or bottom-up approach, the machine learning model will output a heat map of confidence for key points. The heat map is designed to have a large value at the location of the key point and one can compute the location of keypoints by calculating the location of the largest values.

Input image (standard image database)

Confidence

In the case of the top-down approach, only one person is in the picture, so key points can be calculated from confidence only.

In contrast, in the bottom-up approach, multiple people are in the picture at the same time, and multiple set of keypoints are detected simultaneously. Therefore, each keypoint needs to be grouped and assigned to multiple people.

In the bottom-up approach, keypoints are grouped together based on Part Affinity Fields (PAF). PAF contains information that regarding the connections between keypoints. The keypoint assignment problem is solved by integrating the PAF values between keypoints and selecting the combination with the highest value.

Part Affinity Fields

Architecture

Since PoseResnet is a top-down approach, it computes the skeleton only using confidence data. PoseResnet was developed to serve as a baseline and has a simple architecture that combines a ResNet backbone combined with Deconvolution.

Source: https://arxiv.org/pdf/1804.06208.pdf

PoseResnet performs better than traditional architectures such as Hourglass and CPN on the COCO dataset.

Source: https://arxiv.org/pdf/1804.06208.pdf

CMU-Pose in the table below refers to the popular OpenPose.

Source: https://arxiv.org/pdf/1804.06208.pdf

Keypoint definition

PoseResnet, like OpenPose, outputs 18 keypoints in COCO format.

Source: https://github.com/CMU-Perceptual-Computing-Lab/openpose/blob/master/doc/media/keypoints_pose_18.png

Usage

In ailia SDK, you can apply person detection using YOLOv3 Tiny and pose estimation by PoseResnet to a web camera video stream with the following sample code. The model to be converted is pose_resnet_50_256x192.pth.tar

$ python3 pose_resnet.py -v 0

ailia-ai/ailia-modelsAilia input shape: (1, 3, 256, 192) Range: [-2.0, 2.0] Automatically downloads the onnx and prototxt files on the first…github.com

Below is an example of the result you can expect from PoseResnet.

Overview

Top-down vs. bottom-up

Confidence and Part Affinity Fields

Architecture

Keypoint definition

Usage

Related topics