Kamusi - The Global Online Living Dictionary

We draw on the collective knowledge of scholars and native speakers across the globe, with the moonshot mission to organize and make freely available as much knowledge of as many languages as can possibly be digitized. We believe that language is a bridge. Our goal is to build a "living" resource that constantly grows and adapts, ensuring that no language is left behind in the digital age. "Kamusi" is the Swahili word for dictionary, but we aim to be much more than a dictionary has ever been - we strive toward every word in every language, from human natural intelligence, not AI.

The Kamusi Project began at Yale University as the Internet Living Swahili Dictionary in 1994, in the same month as the first web browser, four years before Google, and seven years before Wikipedia. The original idea was to divide the work of creating a dictionary among the small community of Swahili speakers who were online at that time, and share the results for free. Our premise was that the Internet would eventually become widely available to students and the public in Africa, but the money for them to buy a modern print dictionary would not. Over the next few years, the dictionary grew to about 60,000 terms between Swahili and English, with about a million look-ups per month.

This early success attracted scholars for other African languages to seek to join in. We soon realized that a multilingual dictionary was vastly more complicated than a bilingual dictionary, and required an entirely new lexicographic approach. Long story short, we developed the idea of "molecular lexicography", which further evolved into the Kam4D graph database that can organize complex elements of shape, sound, and meaning across languages, dialects, time, and space.

The goal has thus expanded: produce as much high-quality linguistic data for as many languages as possible, while remaining completely free to all. We have been working with the African Union to develop Kamusi as the data platform for the 2375 languages of Africa's 1.5 billion people, while keeping other under-resourced languages firmly within our vision. With this larger mission, Kamusi spun off in 2007 as an independent organization (technically two distinct legal entities, a 501(c)(3) non-profit in the US and a registered NGO in Switzerland).

We face two major challenges:

Our business model sucks. We aim to produce knowledge of the highest possible quality, with the direct engagement of speakers of a myriad of languages. This involves the development and management of complicated and secure participatory resources, which is expensive. Then we want to share all this knowledge for free, which is also expensive. Moreover, we want to do this for languages throughout Africa, and other non-lucrative languages, which wealthy countries and companies do not deem worthy of their attention or support.
AI. The rise of Artificial Intelligence is a disaster for the languages we most seek to support. Simply put, AI depends on the existence of digitized data, systematically provided by real human speakers of each language. For all but the world's wealthiest languages, only a tiny amount of data is available for even the most rudimentary computational purposes, if any at all. However, the world has been gas-lit into imagining that AI will magically inhale language data from thin air. "LLM" stands for Large Language Model - and "Large" refers to huge amounts of data, not the miniscule amounts available for most languages. AI for African languages cannot happen and will not happen - unless and until we collect the real words of real people, and use that to train future LLMs. Kamusi is keen to grow that data, and to make it available for AI going forward, but we are hampered by the fantasies so many people have that the work has already been done, or will somehow happen without the hard work of data collection and digitization that went into languages like English and French during the past 70 years.

An additional problem with our servers and back end brought the site down for an extended time. Our project began with green screens and floppy discs and the primitive coding architecture of the day. As the back-end technologies evolved, we kept retrofitting ever-grander blueprints on ever-shifting foundations - for example, we moved our data from Word to Excel to MySQL to Drupal to Neo4j, as our needs expanded an better systems came into existence. Eventually we had too much old code conflicting with too many upgrades for to many back end processes. Our server self-destructed, and our expertise in multilingual computational linguistics did not prove relevant to the sysadmin skills needed to get us back online. When it became clear that no amount of chewing gum and bailing twine could fix thirty years of patchwork innovation, we decided our most effective path forward was to entirely rebuild and relaunch the system on the most modern technical frameworks.

Currently, we have revived services for several initial languages that serve various testing purposes in our path toward restoration. We have tons of data that we need to restore with care - we are not an AI vacuum cleaner that sucks in random words and spits back language-like slop. Thus, we are taking whatever time is necessary to give you the best possible access to the best language data we have waiting for you, and to collect ever better data for those languages and many more in the years to come.

About Kamusi

Our History & Mission

Meet the Team

Dr. Martin Benjamin

Jérôme Bâton

Kelvin Kamau