Madmon
Multilingual text embeddings specialized for Arabic dialects, Berber languages, and other underrepresented language families. Built on Google's EmbeddingGemma architecture with deep focus on the languages frontier models ignore.
Architecture
Built on EmbeddingGemma
Madmon extends Google's EmbeddingGemma — a 2-billion-parameter model designed specifically for generating text embeddings. The base architecture produces 768-dimensional vectors with an 8,192-token context window.
We further trained this foundation on curated multilingual data, with a focus on:
- Arabic dialects — Darija, Egyptian, Gulf, Tunisian, Levantine, and more
- Berber languages — Tamazight, Kabyle, Tachelhit in Latin, Arabic, and Tifinagh scripts
- Other low-resource families — African, Southeast Asian, and indigenous languages
Semantic Search
Search across multilingual corpora. A Darija query retrieves relevant MSA, French, or English documents through shared semantic space.
Content Classification
Classify Arabic dialect content, social media posts, and user-generated text with high-quality representations.
Cross-Lingual Retrieval
Match content across scripts and languages. Arabizi, Arabic-script, and Latin-script content mapped to the same embedding space.
Clustering & Analytics
Group social media content by dialect, topic, or sentiment. Discover patterns in multilingual datasets.
Resources
Try Madmon now
Generate embeddings for any text. 768-dimensional vectors optimized for multilingual content.
Open Playground