Sign Language with Machine Learning
How we do it

Model Sign Language with Machine Learning
That’s our goal from the beginning. But how can we model Sign Language? It is a subtle mix of pose estimation, continuous movement, event detection, and natural language processing challenges. From the machine point of view, our task is formidable.
The challenge
We have a series of video frames sampled according to the recorded FPS (frames-per-second). Examining only a single frame would be already a difficult task; but somewhat acceptable. We are used to the typical image processing challenges, like light, position, or quality variations. But here we have videos.
Hence to recognize specific actions, we need to analyze a whole sequence of images instead. A series of frames will create glosses – a single gesture, similar to a word for us.
Gloss is an annotation system that applies a label (a word) to the sign. Number glosses depend on a language and dataset. It is usually a higher number as it must define as many words (glosses) as possible.
Finally, a sequence of glosses will create a sentence. But translating it one-to-one might still be not enough. Sign Languages often have entirely different grammar than spoken language. For instance, Polish Sign Langauge has completely different grammar than Polish.
Dziecko płakało w dzień i w nocy. (Child was crying all day and night.)
But gloss translation will result in:
Dziecko płakało cały noc dzień i noc. (Child cry whole night day and night)
where most words are not lemmatized and without the correct tense.
Approach it like a language
However, we already know how to deal with languages. This is nothing more than ASR – Automatic Speech Sign Recognition System and Natural Language Processing task.
Instead of a voice, we analyze the videos. Instead of words, we have letters. Easy, huh?
A common and successful approach to Sign Language Recognition uses the Transformers model and selected Feature Extractor: Graph Convolutional Networks or CNNs.
Pipeline prototype
Inspired by the SOTA, we prepared a proposition of our pipeline.
Everything starts with data
As mentioned earlier, our input is a video. Preprocessing the whole video is not exactly the fastest solution hence instead, we use only the selected video frames. A sequence of video frames is then passed to the feature extractor.
Its job is to find the most relevant information in a single frame and pass it to the transformer. A transformer will analyze the sequence of representations and extract new information from the context. The new enhanced, context-aware representation will be passed to a classification head, which will guess what kind of a gloss it is.
Video --> Frames --> Feature extractor --> Transformer --> Classification heads --> Prediction for video
Let’s look closer at each part of the model!
§ Feature extractors
A feature extractor is a model used to extract features directly from the input: e.g., a set of video frames. For instance, we use a CNN that extracts features from each video frame with multi_frame_feature_extractor
. Another approach could be extracting the pose of a signer and passing the coordinates to GCN (Graph CNN), for each frame separately.
Feature extractor returns a representation feature vector for every frame.
§ Transformer
The transformer is a widely-used deep learning model that adopts the mechanism of self-attention, differentially weighting the significance of each part of the input data. It is used primarily in natural language processing (NLP). The Transfomer model will get representation from feature_extractor of size num_frames,representation_size
for each video in our pipeline.
§ Classification heads
Pipeline handles multi-head classification. We predefine classification_heads
for both Gloss Translation and HamNoSys recognition. Our classification_heads
are defined in our repository code.
hamnosys_heads = {
"symmetry_operator": 9,
"hand_shape_base_form": 12,
"hand_shape_thumb_position": 4,
"hand_shape_bending": 6,
"hand_position_finger_direction": 18,
"hand_position_palm_orientation": 8,
"hand_location_x": 5,
"hand_location_y": 37,
} # number of classes for each head
gloss_head = {"gloss": 2400} # number of classes for each head
Hamburg Sign Language Notation System (HamNoSys) is a gesture transcription alphabetic system that describes the symbols and gestures such as hand shape, hand location, and movement. Read more about HamNoSys here - Introduction to HamNoSys and here - Introduction to HamNoSys Part 2. HamNoSys always have the same number of possible classes.
The Sign Language Model
We use pytorch lightning to implement our models, as it gives excellent flexibility while reliving us from writing training and validation loops. It’s as simple as that!
You can take a look at a full model here.
class GlossTranslationModel(pl.LightningModule):
"""Awesome model for Gloss Translation"""
def __init__(self):
super().__init__()
# models-parts
self.model_loader = ModelLoader()
self.feature_extractor = self.model_loader.load_feature_extractor()
# use same feature extractor for every frame
self.multi_frame_feature_extractor = MultiFrameFeatureExtractor(
self.feature_extractor
)
self.transformer = self.model_loader.load_transformer()
# dynamically make as many heads as required
self.cls_head = list()
print(self.num_classes_dict)
for value in self.num_classes_dict.values():
self.cls_head.append(nn.Linear(transformer_output_size, value))
def forward(self, input, **kwargs):
predictions = list()
x = self.multi_frame_feature_extractor(input)
x = self.transformer(x)
for head in self.cls_head:
predictions.append(head(x.cpu()))
return predictions
Repository
Want to try it out? Check out the repository here: github.com/hearai/hearai But careful! It’s still in progress.