Sign2Text is a real-time sign language recognition system that captures webcam input and displays corresponding words as subtitles.
It is designed to assist communication by recognizing hand gestures using deep learning and computer vision.
- Auto prediction without button press
- MediaPipe-based hand keypoint extraction
- Conv1D + BiLSTM sequence classification
- Top-3 predictions with confidence scores
- Temperature scaling for stable outputs
- Real-time PyQt5 GUI (no keyboard interaction)
- Data augmentation & evaluation support
-
OS & Tools:
- Windows 10 / macOS (Apple Silicon)
- Anaconda (Python 3.8.20)
- Visual Studio Code
-
Core Libraries:
numpy==1.24.3pandas==2.0.3opencv-python==4.11.0.86mediapipe==0.10.11scipy==1.10.1tqdm==4.64.1pillow==10.4.0tensorflow==2.13.0keras==2.13.1PyQt5==5.15.9
| Role | Name (GitHub) | Responsibility | Details |
|---|---|---|---|
| ๐งโ๐ผ Team Lead | An Jihee | Modeling, Real-time Inference System | Built Conv1D + BiLSTM model and PyQt5-based GUI for real-time sign recognition, Conducted sequence length testing |
| ๐ฉโ๐ป Member | Kim Minseo | Data Collection, Preprocessing, Evaluation | Extracted raw keypoints, constructed labeled CSVs, and participated in testing |
| ๐ฉโ๐ป Member | Lee Jimin | Data Augmentation, Evaluation | Generated augmented sequences and conducted hold-out testing |
Sign2Text/
โโโ dataset/
โ โโโ npy/
โ โโโ augmented_samples/
โโโ models/
โ โโโ L10/, L20/, ...
โ โโโ sign_language_model_normalized.h5
โ โโโ label_classes.npy
โ โโโ X_mean.npy
โ โโโ X_std.npy
โโโ src/
โ โโโ dataset_preprocessing/
โ โ โโโ add_angles_to_merged.py
โ โ โโโ batch_generate_csv.py
โ โ โโโ create_total_seq.py
โ โ โโโ merge_csv.py
โ โ โโโ zip_csv.py
โ โโโ hold_out_test/
โ โ โโโ holdout_test.py # Run predictions on unseen samples
โ โ โโโ auto_infer.py
โ โ โโโ make_test_labels.py
โ โ โโโ holdout_results.csv
โ โ โโโ test_labels.csv
โ โโโ predict/
โ โ โโโ predict_test_sample.py
โ โ โโโ predict_test_sample_normalized.py
โ โ โโโ label_similarity_filter.py
โ โโโ train/
โ โ โโโ train_by_seq.py
โ โ โโโ train_by_seq_aug.py
โ โโโ viz/
โ โ โโโ merge_aug_origin_npy.py
โ โ โโโ viz_confusion_top3.py
โ โ โโโ viz_history.py
โ โโโ webcam/
โ โโโ webcam_test.py # For data collection
โ โโโ realtime_infer_test.py # Lightweight prediction only
โ โโโ sign2text_gui.py # PyQt5-based GUI app (Main App)
โโโ requirements.txt
โโโ README.md
python src/webcam/sign2text_gui.py- Webcam preview + Korean font rendering
- Left panel: video feed
- Right panel:
- Status (
๋๊ธฐ ์ค/์์ง ์ค) - Top-3 predictions (with confidence)
- Result display (
์ ๋ขฐ๋ ๋ถ์กฑif below threshold)
- Status (
- Buttons:
์์ง ์์: toggles sample collection
๐ฅ Temperature scaling and confidence thresholding included
๐ Sequence is auto-cleared after prediction
The following GIF demonstrates a successful real-time recognition of the sign language gesture for "treatment" (์น๋ฃ) using our desktop application built with PyQt5 and MediaPipe.
- Input Shape:
(sequence_length, 114)- 84 keypoint features (21 points ร 2 hands)
- 30 joint angles
- Layers:
- Conv1D โ BatchNorm โ Dropout
- Conv1D โ BatchNorm โ Dropout
- BiLSTM โ Dropout
- Dense โ Dropout โ Softmax
SEQ_NAME = "L20"SEQ_NAMEdefines the window size- Supported values:
"L10","L20", etc. - Make sure the model and
X_mean.npy,X_std.npyinmodels/L##match this name
The Sign2Text project was trained on a curated dataset containing 61 sign language labels, constructed from both the original AI Hub data and newly augmented samples. The distribution of labels is as follows:
| Category | Label Count | Description |
|---|---|---|
| ๐ฆ Original only | 31 labels | Labels that exist only in the original dataset (.npy without augmentation) |
| ๐ Common (Shared) | 19 labels | Labels included in both the original and augmented data |
| โ Augmented only | 11 labels | Newly added labels from webcam-based augmentation |
๋ฐฅ์ฅ, ์ถ๊ทผ, ํด์ฌ, ํฌ์ผ, ์ฌ์, ํ์
, ์ฌํ๊ต, ๋ฐฑ์, ์ฑํ
, ์ ํ,
๋ด์ง๋๋, ๋จ์, ๋
์์ค, ์ ํ, ๊ตญ์ดํ, ๋ค๊ณผ, ์ํ, ์์คํค, ์ธ์ฐ, ๊ตฌ์ง,
ํ๊ต์ฐํ, ๋ฌธํ, ์์ต, ์ฌ์ง, ์น์๋ค, ๋ฒ๊ฟ, ๋ฐฐ๋๋ฏผํด, ๋ฒ์ค๊ฐ, ์๋น, ์์ธ
๊ฐ๊ธฐ, ๊ฐํ, ๊ฒฝ์ฐฐ์, ๋
์, ๋
์ผ์ด, ๋ผ๋ฉด, ๋ณ๋ฌธ์, ๋ณด๊ฑด์, ์๋ฉด์ , ์ ,
์ฌํ๋ค, ์ซ์ดํ๋ค, ์ปคํผ, ์ฝ๋ผ, ํด์, ์น๋ฃ, ํ๊ต, ์
์, ์์ธ
๊ฟ, ๋(1์ธ์นญ), ๋(2์ธ์นญ), ๋ธ, ์๋ค, ์๋
ํ์ธ์, ์์ด, ์ด๋, ์
์ฌ, ์ข๋ค
A Venn diagram showing the overlap between original and augmented label sets.
๊ฐ๊ธฐ, ๊ฐํ, ๊ฒฝ์ฐฐ์, ๋
์, ๋
์ผ์ด, ๋ผ๋ฉด, ๋ณ๋ฌธ์, ๋ณด๊ฑด์, ์๋ฉด์ , ์ ,
์ฌํ๋ค, ์ซ์ดํ๋ค, ์ปคํผ, ์ฝ๋ผ, ํด์, ์น๋ฃ, ํ๊ต, ์
์, ์์ธ
๊ฟ, ๋(1์ธ์นญ), ๋(2์ธ์นญ), ๋ธ, ์๋ค, ์๋
ํ์ธ์, ์์ด, ์ด๋, ์
์ฌ, ์ข๋ค
- Only 30 augmented labels (common + augmented-only) were recognized reliably in real-time testing.
- Original-only labels were not recognized, even if included during training.
- This suggests that data recency and augmentation quality have a stronger impact on performance than just label presence.
- Labels with freshly collected webcam samples showed significantly higher prediction confidence.
- Run:
python src/webcam/webcam_test.py- Press
sโ show gesture - Press
wโ saveraw_seq_*.npyandnorm_seq_*.npy - Data saved at:
dataset/augmented_samples/<label>/
python src/train/train_by_seq_aug.py- Merges raw + augmented samples
- Saves model and normalization stats to
models/L##/
To train without augmentation:
python src/train/train_by_seq.pypython src/hold_out_test/holdout_test.py- Loads samples from
videos/, usestest_labels.csv - Outputs to
holdout_results.csv - Visualize results with:
python src/viz/viz_confusion_top3.pypython src/predict/label_similarity_filter.py- Computes cosine similarity between mean label vectors
- Useful to identify confusing signs
- Input: merged dataset with angles
viz_history.py: plot training historyviz_confusion_top3.py: visualize confusion matrixmerge_aug_origin_npy.py: merge original/augmented samples for comparison
- On macOS, use
cv2.VideoCapture(1)if0doesn't work - Use Korean font:
AppleGothicormalgun.ttffor readable text - Recommended: collect 30+ samples per label for robust accuracy
This project is part of the Open Source Programming course at Sookmyung Women's University.
It uses MediaPipe and TensorFlow under the Apache 2.0 License.
