Madmon

Multilingual text embeddings specialized for Arabic dialects, Berber languages, and other underrepresented language families. Built on Google's EmbeddingGemma architecture with deep focus on the languages frontier models ignore.

Open Playground

Base Model

EmbeddingGemma

Parameters

Embedding Dim

768

Context Length

8,192 tokens

Training Data

5B tokens

Focus

Arabic dialects, Berber

Output

Dense vectors

License

API access

Architecture

Built on EmbeddingGemma

Madmon extends Google's EmbeddingGemma — a 2-billion-parameter model designed specifically for generating text embeddings. The base architecture produces 768-dimensional vectors with an 8,192-token context window.

We further trained this foundation on curated multilingual data, with a focus on:

Arabic dialects — Darija, Egyptian, Gulf, Tunisian, Levantine, and more
Berber languages — Tamazight, Kabyle, Tachelhit in Latin, Arabic, and Tifinagh scripts
Other low-resource families — African, Southeast Asian, and indigenous languages

Semantic Search

Search across multilingual corpora. A Darija query retrieves relevant MSA, French, or English documents through shared semantic space.

Content Classification

Classify Arabic dialect content, social media posts, and user-generated text with high-quality representations.

Cross-Lingual Retrieval

Match content across scripts and languages. Arabizi, Arabic-script, and Latin-script content mapped to the same embedding space.

Clustering & Analytics

Group social media content by dialect, topic, or sentiment. Discover patterns in multilingual datasets.

Resources

Research Page

Try Madmon now

Generate embeddings for any text. 768-dimensional vectors optimized for multilingual content.

Open Playground