Independent GenAI R&D Lab
AI for the languages the industry ignores
We build language models, datasets, and tools for underrepresented languages and cultures β starting with the 500 million Arabic speakers whose dialects remain invisible to frontier AI.
What we do
Three pillars of research, one mission
Low-Resource Languages
Pre-trained models, tokenizers, and datasets for 340+ underrepresented languages. No GPU required.
Agentic AI
End-to-end agentic LLM training pipelines, MCP tooling, and broad-coverage agent behaviors.
Cultural Alignment
Language models that reflect the values, norms, and linguistic realities of non-Western cultures.
Publications
Research
Sawtone: A Universal Framework for Phonetic Similarity and Alignment Across Languages and Scripts
Lingua Posnaniensis, Vol. 67(1)
GenAI for Moroccan Darija: Challenges and Early Results
University of Navarra, Spain
Gherbal: A Multilingual Classifier for Low-Resource Languages
University Hassan II, Casablanca, Morocco
Open Source
Projects
Sawalni
The first LLM and AI assistant built specifically for Moroccan Darija, supporting both Arabic and Latin scripts. Trained from scratch with a custom corpus capturing Darija's linguistic and cultural subtleties. Served thousands of users in controlled and public deployments. Built by Omneity Labs founder as a solo project in 2023.
wikilangs.org
NLP models for 340+ Wikipedia languages β no GPU required. An open playground enabling researchers, educators, and language communities to explore pre-trained NLP tools for languages with little to no commercial support.
Sawtone
An open framework for cross-script phonetic alignment and text normalization. Built to solve the pre-processing problem for alloglottographic and non-standardized languages (like Darija). Published in Lingua Posnaniensis.
wikipedia-monthly
Monthly-updated, ready-to-use Wikipedia dumps for all 340+ language editions on HuggingFace. Used by leading AI labs including Nous Research as part of their training pipelines. The go-to dataset for researchers needing fresh, clean Wikipedia data at scale.
Stay connected
Reducing the digital divide, one language at a time
Get updates on our research, open-source releases, and upcoming publications.