Speaker
Description
The sensitivity to the subtle facial changes demands a dedicated attention model that models a geometric layout of faces in video sequences to enable the self-attention-based architectures to understand the structural and dynamic aspects of the facial data for engagement monitoring. Convolutional networks lack this. We present a convolution-free approach to video-based facial expression recognition exclusively built on the TimeSformer architecture. Our method, named “FMeshformer”, adapts the standard TimeSformer architecture for generic action recognition by applying the mesh positional encoder after the patch embedding. Our proposed model overcomes the limitation of the linear positional embedding, which fails to capture the nuanced spatial relationship between facial features, by focusing on geometric layouts(mesh-aware) of the faces in facial-expression-based videos to understand the dynamic and structural aspects of engagement using facial data. Despite the innovative design, the FMeshformer achieves state-of-the-art results on facial expression detection using the benchmark DAiSEE dataset with a test accuracy of 71%. Finally, compared to other three-dimensional convolutional networks, our model is faster to train on a new dataset, it achieves a significantly higher test efficiency, and is applied to longer video clips (over 60 seconds long).