Back to Projects 🗣️ Voice & NLP Open Source — Proof of Concept

SALAMA

Scalable African LAnguage Multimodal AI Framework — an open-source Speech-to-Speech system built to empower African languages in voice-driven applications, starting with Swahili.

The Mission

Bridge the African linguistic digital divide by providing high-quality, culturally aligned speech and language intelligence systems. SALAMA enables seamless end-to-end voice interaction by converting speech → text → intelligent response → speech.

The SALAMA Pipeline

Three modular components working together for end-to-end voice interaction.

Speech-to-Text (STT)

Robust transcription for African languages using Whisper Small, fine-tuned for Swahili speech patterns and noisy environments.

Language Model (LLM)

Context-aware reasoning and response generation using UlizaLlama, instruction-tuned for natural Swahili text generation.

Text-to-Speech (TTS)

Natural, expressive voice synthesis using Facebook MMS (VITS-based), fine-tuned for Swahili prosody and tone.

Key Features

Designed for extensibility, performance, and real-world impact.

🔧

Modular Architecture

STT, LLM, and TTS can be swapped, upgraded, or fine-tuned independently.

🌍

Low-Resource Languages

Models fine-tuned for African speech patterns and dialectal variations.

🖼️

Multimodal-Ready

Framework designed to support additional modalities such as vision.

⚡

Real-Time Pipeline

Optimized for low-latency voice interaction in conversational agents.

🧩

Extensible

Simple configuration for integrating new languages or tasks.

Evaluation Results

Strong performance across all modules, validated on real-world Swahili data.

STT — Word Error Rate

0.43

62% improvement over baseline Whisper

STT — Accuracy

95.4%

On Swahili validation set

LLM — QA Accuracy

95.5%

On Swahili question-answering tasks

LLM — BLEU

0.49

Fluency and translation accuracy

TTS — MOS Score

4.05

Human-rated naturalness (out of 5.0)

LLM — F1 Score

0.90+

Weighted average across tasks

Interaction Modes

SALAMA supports six flexible modes for voice and text interaction.

🗣️→🗣️

Speech-to-Speech

Voice → LLM → Voice. Full end-to-end voice conversation.

✍️→✍️

Text-to-Text

Text → LLM → Text. Standard chat interaction.

🗣️→✍️

Speech → Text

Voice input, text response via LLM processing.

✍️→🗣️

Text → Speech

Text input, voice response with natural synthesis.

🎙️

Direct STT

Transcription only — no LLM processing.

🔊

Direct TTS

Synthesis only — no LLM processing.

Models on HuggingFace

All SALAMA models are open-source and available on HuggingFace.

🎙️

Python PyTorch HuggingFace Transformers Ollama ONNX LoRA / QLoRA VITS

Get Started

SALAMA is open-source under the MIT License. Clone the repository, install dependencies, and start building voice-powered African language applications.

View on GitHub

Back to Projects