This paper addresses the problem of human-action recognition by introducing a sparse representation of image sequences as a collection of spatiotemporal events that are localized at points that are salient both in space and time. The spatiotemporal salient points are detected by measuring the variations in the information content of pixel neighborhoods not only in space but also in time. An appropriate distance metric between two collections of spatiotemporal salient points is introduced, which is based on the chamfer distance and an iterative linear time-warping technique that deals with time expansion or time-compression issues. A classification scheme that is based on relevance vector machines and on the proposed distance measure is proposed. Results on real image sequences from a small database depicting people performing 19 aerobic exercises are presented.
pubs.doc.ic.ac.uk: built & maintained by Ashok Argent-Katwala.