Gestures are an important form
of communication between people. We regard expressions of the face as one of
the most natural forms of human expression and communication. People who are
elderly, disabled or just inexperienced users of computer technology a gesture
interface would open the door to many applications ranging from the control of
machines to “helping hands”. The crucial aspect of a gesture interface is not
only real-time performance, but also the ability to operate robustly in
difficult real world environments.
To understand human gestures based on head movement
a system must be capable of tracking facial features in real-time. We consider
real time to be NTSC video frame rate (30Hz). If facial tracking is done at
lower rates then it is very difficult to understand gestures.
The real-time facial gesture
recognition system consists of two modules running in parallel; a Face Tracker
and a Gesture Recognizer. The face-tracking module fuses information from the
vision system with information derived from a two-dimensional model of the face
using multiple Kalman filters.
We use dedicated hardware, which tracks features in
real-time using template matching. Relying solely on such dedicated hardware it
is not possible to reliably and robustly track a human face since under normal
lighting conditions the shape and shading of facial features will change
markedly when the head moves. This results in a failure by the vision hardware
to correctly match the changing templates. Kalman filters are used to solve this
problem that uses data from the tracking system with a geometrical model of the
face. A face tracker is built that operates under natural lighting without
artificial artifacts. The system is robust and runs at video frame rate.
Reliable and rapid tracking of the face gives rise to ability to recognize
gestures of the head. A gesture consists of a chain of atomic actions, where
each atomic action represents a basic head motion. e.g.: upwards or to the
right etc. The “yes” gesture is represented the atomic action chain of “move
up”,”stop”,”move down”, etc. if an observer reaches the end of a chain of
atomic actions then a gesture is deemed to have been recognized. We use a
probabilistic approach to decide if an atomic action has been triggered. This
is necessary since it is rare for identical actions to be exactly the same
e.g.: nobody nods in the same way every time.
THE VISION SYSTEM:
The use of MEP tracking system
is made to implement the facial gesture interface. This vision system is
manufactured by Fujitsu and is designed to track in real time multiple
templates in frames of a NTSC video stream. It consists of two VME-bus cards, a
video module and tracking module, which can track up to 100 templates
simultaneously at video frame rate (30Hz for NTSC).
The tracking of objects is
based on template (8x8 or 16x16 pixels) comparison in a specified search area.
The video module digitizes the video input stream and stores the digital images
into dedicated video RAM. The tracking module also accesses this RAM. The
tracking module compares the digitized frame with the tracking templates within
the bounds of the search windows. This comparison is done by using a cross
correlation which sums the absolute difference between corresponding pixels of
the template and the frame. The result of this calculation is called the
distortion and measures the similarity of the two comparison images. Low distortions
indicate a good match while high distortions result when the two images are
quite different.
To track a template of an object it is necessary to
calculate the distortion not only at one point in the image but at a number of
points within the search window. To track the movement of an object the
tracking module finds the position in the image frame where the template
matches with the lowest distortion. A vector to the origin of the lowest
distortion represents the motion. By moving the search window along the axis of
the motion vector objects can be easily tracked. The tracking module performs
up to 256 cross correlations per template within a search window.
The MEP tracking vision system
works perfectly for objects that do not change their appearance, shade and are
never occluded by other objects.
When the vision system is used
to track a face in a head and shoulder image of a person then problems arise
because the head occupies most of the image, one template of the entire face
exceeds the maximum template size allowable in the vision system. Therefore, it
is only possible to track individual features of the face such as the eyes or
mouth. The facial features with high contrast are good candidates as tracking templates. For e.g.: an eyebrow which appears to be a dark stripe on a light
background (light skin) and the iris of the eye which appears as dark spot
surrounded by the white of the eye are well suited for tracking.
These problems are further complicated by the fact
that well suited tracking features can change their appearance dramatically
when a person moves their head. The shading of the features can change due to
uneven illumination and the features appear to deform when the head is turned,
moved up, down or tilted to the side. All these changes increase the distortion
even if a template is matching precisely at the correct position. It also
results in low distortions at the wrong coordinates,which then cause the
search window to be incorrectly moved away from the feature. This problem arises
when a head is turned sufficiently far enough for one half of the face with all
its associated features to completely disappear. Once the tracking feature has
left the search window the movement vectors calculated by the vision system are
unpredictable. There is a method developed to allow a search window to
correctly find its lost feature thus yielding a reliable face tracker.
NEXT COMING:::TRACKINGTHE FACE,CONCLUSION