Indigenous languages and Small Language Models: Creating Open Source Protocols for Community Toolkits

conference poster.png

This event offers a meaningful space to advance work on underrepresented Indian languages, especially on Indigenous (Scheduled Tribe) and adivasi language revitalisation which are increasingly threatened by digital developments and diminishing populations. Many languages in India face the impacts of decreasing speaker communities, which affects inter-generational knowledge transmission, dialect diversity, cultural knowledge, and the use of ‘Traditional Cultural Expressions and Ecological Knowledge’ embedded in everyday speech. As younger generations shift to dominant regional languages, there is an urgent need to create new, community-owned pathways for documentation, learning, and digital assets, governance or stewardship.

The workshop’s focus on small, domain-specific language models aligns closely with larger objectives of democratising AI resources, and developing these technologies for economic growth and social empowerment. Large language models often flatten local meanings, merge dialects, and overlook culturally embedded semiotics because Indigenous languages are severely underrepresented in big data. For these languages to survive in digital spaces, the models must emerge from the communities themselves. This event provides a platform for exploring open source protocols that allow communities to steer how their linguistic materials are recorded, annotated, and used.

We see value in approaches that keep communities at the centre of designing methods, rather than reducing them to data providers. Small language models that are participatory, transparent, and adaptable will encourage more youth to engage in language documentation as a form of cultural work and technological practice. They allow communities to define what is sensitive, what can be shared, and what should remain local. This not only complements state-led efforts such as MLE and MTB, but also acknowledges that language learning happens outside the formal education system through cultural collectives, community-led field documentation, creating and accumulating source materials, and intergenerational knowledge exchange.

The hope is that those participating in the event will help strengthen methodologies for community-led language documentation, explore possible collaborations with technologists and open source groups, and ensure that underrepresented, low-resource, Indigenous languages can enter machine learning ecosystems in ways that respect local contexts, consent, and meaning. This will thus create the foundation for a community with a commitment to inclusion, cultural resilience, and ethical technology practices at the hyperlocal level.

Event description:

This hybrid event will bring together practitioners, researchers, community representatives and technologists to explore strategies and protocols for co-creating domain specific small language models using open source AI tools. The event will have three tracks: technologies, community participation and governance - and participants will present their research or areas of enquiry as well as the challenges they have faced, in a workshop environment that will facilitate discussion, peer learning and resource sharing.

The event will demonstrate how to build capacity for building AI interventions at the hyperlocal level, to enhance social inclusion and democratising AI, creating opportunities for developing suitable open source solutions in order to ensure that the benefits of AI are within reach for even those who might be marginalised by socio-economic and other means such as gender, caste, language etc.

Desired outcomes:

To build a robust community of non-corporate actors in the space of developing indigenous language models;
To create a toolkit that can be used by communities to ensure the legacy and longevity of their local languages and cultures;
To popularise the concept that LLMs are not the only meaningful use of AI

Schedule

10: 30 - 10:45 Opening Comments, Framing of the problem and need for this meeting (Padmini Ray Murray, Design Beku)

10:45 - 12 Community activism (moderating: Subhashish Panigrahi)

Ramjit Tudu, Santali Language Activist
Ashish Birulee, Adivasi Lives Matter
Ganesh Birua, Ho Jagar
Akash Poyam, The Caravan
Rahi Soren, Jadavpur University
Ranjan Prasad & Faisal Rahman, Keystone Foundation

12 - 13:00 Discussion: Ethics and safety (moderating: Tarunima Prabhakar, Tattle)

13:00 - 14:00 Lunch

14:00 - 15:00 Demos: (moderating: Sneha PP, OKI IIITH)

Computational Mama, Ajaibghar and Gooey.AI
Benu/ Karthick, Unreal-Tec
Abhas, MostlyHarmless.io
TB Dinesh, Janastu

1500 - 1600: Discussion: Data Governance for Small LLMs (moderating: Namita Aavriti Malhotra, APC)