Back to Projects 🗣️ Voice & NLP Open Source — Proof of Concept

SALAMA

Scalable African LAnguage Multimodal AI Framework — an open-source Speech-to-Speech system built to empower African languages in voice-driven applications, starting with Swahili.

The Mission

Bridge the African linguistic digital divide by providing high-quality, culturally aligned speech and language intelligence systems. SALAMA enables seamless end-to-end voice interaction by converting speech → text → intelligent response → speech.

The SALAMA Pipeline

Three modular components working together for end-to-end voice interaction.

1

Speech-to-Text (STT)

Robust transcription for African languages using Whisper Small, fine-tuned for Swahili speech patterns and noisy environments.

2

Language Model (LLM)

Context-aware reasoning and response generation using UlizaLlama, instruction-tuned for natural Swahili text generation.

3

Text-to-Speech (TTS)

Natural, expressive voice synthesis using Facebook MMS (VITS-based), fine-tuned for Swahili prosody and tone.

Key Features

Designed for extensibility, performance, and real-world impact.

🔧

Modular Architecture

STT, LLM, and TTS can be swapped, upgraded, or fine-tuned independently.

🌍

Low-Resource Languages

Models fine-tuned for African speech patterns and dialectal variations.

🖼️

Multimodal-Ready

Framework designed to support additional modalities such as vision.

Real-Time Pipeline

Optimized for low-latency voice interaction in conversational agents.

🧩

Extensible

Simple configuration for integrating new languages or tasks.

Evaluation Results

Strong performance across all modules, validated on real-world Swahili data.

STT — Word Error Rate
0.43
62% improvement over baseline Whisper
STT — Accuracy
95.4%
On Swahili validation set
LLM — QA Accuracy
95.5%
On Swahili question-answering tasks
LLM — BLEU
0.49
Fluency and translation accuracy
TTS — MOS Score
4.05
Human-rated naturalness (out of 5.0)
LLM — F1 Score
0.90+
Weighted average across tasks

Interaction Modes

SALAMA supports six flexible modes for voice and text interaction.

🗣️→🗣️

Speech-to-Speech

Voice → LLM → Voice. Full end-to-end voice conversation.

✍️→✍️

Text-to-Text

Text → LLM → Text. Standard chat interaction.

🗣️→✍️

Speech → Text

Voice input, text response via LLM processing.

✍️→🗣️

Text → Speech

Text input, voice response with natural synthesis.

🎙️

Direct STT

Transcription only — no LLM processing.

🔊

Direct TTS

Synthesis only — no LLM processing.

Models on HuggingFace

All SALAMA models are open-source and available on HuggingFace.

Python PyTorch HuggingFace Transformers Ollama ONNX LoRA / QLoRA VITS

Get Started

SALAMA is open-source under the MIT License. Clone the repository, install dependencies, and start building voice-powered African language applications.