Master's thesis presentation. Cyrine is advised by Dr. Felix Dietrich, and Dr. Oscar Koller.
Previous talks at the SCCS Colloquium
Cyrine Chaabani : Sign Language Detection for Microsoft Teams
SCCS Colloquium |
Sign language plays a crucial role in facilitating effective communication for individuals with hearing impairments. As technology becomes increasingly integrated into our lives, it becomes imperative to create inclusive platforms that cater to the needs of sign language users, particularly in remote communication and collaboration settings.
This thesis focuses on addressing the specific challenge of sign language detection within the context of Microsoft Teams, a widely utilized communication and collaboration tool. By tackling this challenge, we aim to enhance the accessibility and inclusivity of Microsoft Teams for individuals who rely on sign language as their primary mode of communication. We begin our work by establishing our evaluation metrics: we use unweighted average recall instead of accuracy which better captures the performance in unbalanced datasets and we resort to qualitative evaluation of our best performing model by visualising the classification output and analysing the attention activation weights, similarily to. We also define the datasets that we use throughout our work namely: signing in the wild, the DGS-Corpus and the Teams dataset. Our experimentation begins by setting a VGG16+RNN approach as the baseline model for sign language detection, this has been defined and explored in Borg etal. The baseline model combines the VGG16 convolutional neural network (CNN) for feature extraction and a recurrent neural network (RNN) for leveraging temporal information in video segments. The VGG16+RNN baseline is trained and evaluated to establish a performance benchmark for the task. To explore the potential of human skeleton-based approaches, we introduce the Hierarchical Co-occurrence Network (HCN) architecture as a baseline for skeleton-based sign language detection. The HCN model leverages the hierarchical composition of co-occurrence features extracted from human skeletons. The HCN baseline is trained and evaluated to assess its effectiveness in capturing sign language. Furthermore, we propose the InfoGCN architecture as an advanced model for sign language detection. The InfoGCN model combines attention-based graph convolutions with an information bottleneck framework to achieve its state-of-the-art performance on action recognition benchmarks. We optimize the performance of the InfoGCN model through various approaches: augmenting the human skeleton graph with landmarks, incorporating direct cross-modal connections (e.g., hands, face contours, eyebrows, and mouth), and integrating a graph convolution step into the encoding block of the InfoGCN architecture. We report the UAR on each of these experiments. Our final model achieves a detection UAR of 0.920 on the test split of signing in the wild and 0.825 on the test split of the DGS Corpus.