Independent GenAI R&D Lab

AI for the languages the industry ignores

We build language models, datasets, and tools for underrepresented languages and cultures – starting with the 500 million Arabic speakers whose dialects remain invisible to frontier AI.

Our Research Explore Projects

340+ Languages supported

1 Peer-reviewed publication

1st LLM for Moroccan Darija

Scroll

What we do

Three pillars of research, one mission

🌍

Low-Resource Languages

Pre-trained models, tokenizers, and datasets for 340+ underrepresented languages. No GPU required.

🤖

Agentic AI

End-to-end agentic LLM training pipelines, MCP tooling, and broad-coverage agent behaviors.

🎭

Cultural Alignment

Language models that reflect the values, norms, and linguistic realities of non-Western cultures.

Publications

Research

View all

paper

Sawtone: A Universal Framework for Phonetic Similarity and Alignment Across Languages and Scripts

Lingua Posnaniensis, Vol. 67(1)

2026 02

report

GenAI for Moroccan Darija: Challenges and Early Results

University of Navarra, Spain

2024 03

report

Gherbal: A Multilingual Classifier for Low-Resource Languages

University Hassan II, Casablanca, Morocco

2024

Open Source

Projects

View all

Sawalni

The first LLM and AI assistant built specifically for Moroccan Darija, supporting both Arabic and Latin scripts. Trained from scratch with a custom corpus capturing Darija's linguistic and cultural subtleties. Served thousands of users in controlled and public deployments. Built by Omneity Labs founder as a solo project in 2023.

LLMMoroccan DarijaLow-Resource Languages

wikilangs.org

NLP models for 340+ Wikipedia languages — no GPU required. An open playground enabling researchers, educators, and language communities to explore pre-trained NLP tools for languages with little to no commercial support.

NLPLow-Resource LanguagesWikipedia

Sawtone

An open framework for cross-script phonetic alignment and text normalization. Built to solve the pre-processing problem for alloglottographic and non-standardized languages (like Darija). Published in Lingua Posnaniensis.

PhoneticsTransliterationCross-Script

wikipedia-monthly

Monthly-updated, ready-to-use Wikipedia dumps for all 340+ language editions on HuggingFace. Used by leading AI labs including Nous Research as part of their training pipelines. The go-to dataset for researchers needing fresh, clean Wikipedia data at scale.

DatasetsWikipediaOpen Source

Stay connected

Reducing the digital divide, one language at a time

Get updates on our research, open-source releases, and upcoming publications.

Get in Touch