Real-Time ASL Letter Recognition Software

Motivation

ASL is a vital mode of communication for the Deaf and hard-of-hearing community. While many projects focus on word- or sentence-level translation, accurate recognition of the static ASL alphabet remains a valuable and educational step toward broader interaction. My goal was to build a lightweight, real-time solution capable of recognizing individual ASL letters directly from video input (without relying on expensive hardware or cloud-based APIs) and optimized for standalone use on any laptop.

This project explores the intersection of deep learning, computer vision, and accessibility technology by creating a real-time system capable of recognizing static American Sign Language (ASL) letters from webcam input. The system is powered by a custom convolutional neural network (CNN) trained on the Sign Language MNIST dataset and integrated with real-time video processing and hand tracking tools.

System Architecture

The pipeline consists of the following components:

  • Data Preprocessing and Augmentation:
    I began with the Sign Language MNIST dataset (28×28 grayscale images, 25 classes A–Y excluding J and Z). To improve generalization from static training data to dynamic webcam footage, I applied extensive data augmentation using torchvision transforms: random rotation, affine shifts, grayscale jitter, and normalization.

  • CNN Architecture:
    The model was built from scratch in PyTorch using three convolutional layers followed by fully connected layers. I incorporated batch normalization, ReLU activations, and dropout to prevent overfitting. The model achieved over 98% test accuracy on many classes after training for 10 epochs.

  • Live Video and Inference Loop:
    Using OpenCV and MediaPipe, I captured live webcam input and performed real-time hand detection and tracking. The bounding box around the detected hand was used to extract and crop the hand region, which was then preprocessed and passed through the model for prediction.

  • Inference Averaging:
    To reduce flicker and stabilize predictions, I implemented a 3 second rolling buffer of predicted labels and displayed the most common letter. This smoothing technique significantly improved usability and visual consistency.

  • Output and Interface:
    Bounding boxes and predicted letters are displayed directly on the webcam feed using OpenCV overlays. The model runs entirely on the CPU, with fast enough performance for real-time use.

Technologies Used

  • Python, PyTorch, torchvision

  • OpenCV for video streaming and annotation

  • MediaPipe for hand tracking

  • Google Colab for model training

  • Jupyter Notebook for development and analysis

Results

The final model performs robustly on most ASL letters in real-world conditions, including under changing lighting and hand orientations. While some letters (e.g., Q, G) are more ambiguous without temporal motion, the static classifier works well as an assistive tool or educational demo. Model weights are stored as a .pth file and can be loaded on any machine for real-time testing.

Project Report
Github Repo
Next
Next

Auto Shades