HUMANOID ROBOT: INTRODUCTION AND THE VISION SYSTEM



Gestures are an important form of communication between people. We regard expressions of the face as one of the most natural forms of human expression and communication. People who are elderly, disabled or just inexperienced users of computer technology a gesture interface would open the door to many applications ranging from the control of machines to “helping hands”. The crucial aspect of a gesture interface is not only real-time performance, but also the ability to operate robustly in difficult real world environments.
To understand human gestures based on head movement a system must be capable of tracking facial features in real-time. We consider real time to be NTSC video frame rate (30Hz). If facial tracking is done at lower rates then it is very difficult to understand gestures.
The real-time facial gesture recognition system consists of two modules running in parallel; a Face Tracker and a Gesture Recognizer. The face-tracking module fuses information from the vision system with information derived from a two-dimensional model of the face using multiple Kalman filters.
We use dedicated hardware, which tracks features in real-time using template matching. Relying solely on such dedicated hardware it is not possible to reliably and robustly track a human face since under normal lighting conditions the shape and shading of facial features will change markedly when the head moves. This results in a failure by the vision hardware to correctly match the changing templates. Kalman filters are used to solve this problem that uses data from the tracking system with a geometrical model of the face. A face tracker is built that operates under natural lighting without artificial artifacts. The system is robust and runs at video frame rate. Reliable and rapid tracking of the face gives rise to ability to recognize gestures of the head. A gesture consists of a chain of atomic actions, where each atomic action represents a basic head motion. e.g.: upwards or to the right etc. The “yes” gesture is represented the atomic action chain of “move up”,”stop”,”move down”, etc. if an observer reaches the end of a chain of atomic actions then a gesture is deemed to have been recognized. We use a probabilistic approach to decide if an atomic action has been triggered. This is necessary since it is rare for identical actions to be exactly the same e.g.: nobody nods in the same way every time.

THE VISION SYSTEM:
The use of MEP tracking system is made to implement the facial gesture interface. This vision system is manufactured by Fujitsu and is designed to track in real time multiple templates in frames of a NTSC video stream. It consists of two VME-bus cards, a video module and tracking module, which can track up to 100 templates simultaneously at video frame rate (30Hz for NTSC).
The tracking of objects is based on template (8x8 or 16x16 pixels) comparison in a specified search area. The video module digitizes the video input stream and stores the digital images into dedicated video RAM. The tracking module also accesses this RAM. The tracking module compares the digitized frame with the tracking templates within the bounds of the search windows. This comparison is done by using a cross correlation which sums the absolute difference between corresponding pixels of the template and the frame. The result of this calculation is called the distortion and measures the similarity of the two comparison images. Low distortions indicate a good match while high distortions result when the two images are quite different.
To track a template of an object it is necessary to calculate the distortion not only at one point in the image but at a number of points within the search window. To track the movement of an object the tracking module finds the position in the image frame where the template matches with the lowest distortion. A vector to the origin of the lowest distortion represents the motion. By moving the search window along the axis of the motion vector objects can be easily tracked. The tracking module performs up to 256 cross correlations per template within a search window.
The MEP tracking vision system works perfectly for objects that do not change their appearance, shade and are never occluded by other objects.
When the vision system is used to track a face in a head and shoulder image of a person then problems arise because the head occupies most of the image, one template of the entire face exceeds the maximum template size allowable in the vision system. Therefore, it is only possible to track individual features of the face such as the eyes or mouth. The facial features with high contrast are good candidates as tracking templates. For e.g.: an eyebrow which appears to be a dark stripe on a light background (light skin) and the iris of the eye which appears as dark spot surrounded by the white of the eye are well suited for tracking.
These problems are further complicated by the fact that well suited tracking features can change their appearance dramatically when a person moves their head. The shading of the features can change due to uneven illumination and the features appear to deform when the head is turned, moved up, down or tilted to the side. All these changes increase the distortion even if a template is matching precisely at the correct position. It also results in low distortions at the wrong coordinates,which then cause the search window to be incorrectly moved away from the feature. This problem arises when a head is turned sufficiently far enough for one half of the face with all its associated features to completely disappear. Once the tracking feature has left the search window the movement vectors calculated by the vision system are unpredictable. There is a method developed to allow a search window to correctly find its lost feature thus yielding a reliable face tracker.


NEXT COMING:::TRACKINGTHE FACE,CONCLUSION