Language Documentation Field diary (July 2025–March 2026) – Diff

This is the first field diary of OpenSpeaks Archives, a digital language archive for low-resourced languages, following its second edition launch this July. We focus on community-based documentation of such languages and contribute to the Wikimedia projects, communities, and the broader Wikimedia movement. In this diary, we discuss our collaborations, activities, learning and future plans. Our current year-long project will demonstrate how we imagine oral history in low-resourced languages as a source of knowledge.

Poster of Gejmehac Gipan, a short film produced by OpenSpeaks Archives

September 2025 — seven Fellows, first oral histories live, and Celtic Knot

By September, seven OpenSpeaks Fellows had confirmed participation across four regional clusters — northern India (languages: Marcha-Rongpo, Johari, Jaunpuri, Jaunsari, Bangani), eastern–southeastern India (Sora, Juray, Juang, Gorum/Parengi, Lambadi), and Nepal (Saptariya Tharu, Raji, Kumhali). Each Fellow is from one of the focus language communities, barring one who is from a neighbouring minoritised-language community and a noted researcher.

Opino Gomango, a noted researcher of the Sora-Juray language cluster and our Fellow, began the first field documentation phase for Gorum (Parengi, 60% endangered, no child speakers, 600–6K adult speakers) and Juang (vulnerable, 30K speakers) languages in Mysore, India. Opino did a second documentation in Bhubaneswar in early 2026 while managing community review in parallel. Kimmi Pal (Fellow, Rongpo) finished subtitling most of the Marcha-Rongpo recordings, and one video was edited and published in December. Primary subtitle is complete for all recordings.

In September, I virtually presented at Celtic Knot Conference 2025: “When Can We Cite Low-Resourced Language Oral Histories in Wikimedia Projects?” This was our early attempt to present the citation argument to a live Wikimedia audience. The conversations helped shape how we think about verifiability: the challenge is not that oral history is unreliable, but that the infrastructure to make it reliably citable does not yet exist. That framing became the backbone of the OpenSpeaks Oral History Framework, published in March 2026.

We also planned to build three prototype tools to address technical gaps identified in the pilot: a linear subtitle editor, multimedia metadata viewer and compression helper, and a set of folder-organisation utilities. Tool ideas and early prototypes were shared with the Indic MediaWiki Developers User Group; key members expressed interest in collaborating.

October 2025 — five tools, Commons uploads, Wikimedia Futures Lab selection

We compiled the documentation for all the tools on the OpenSpeaks/Tools page on Meta-Wiki, highlighting development status, specifications, and tech notes. Five prototype webapps were published or in final testing: Commons Metadata Generator, Media Folder Analyser, Multimedia Folder Organiser, Transcript Word Counter, and Print Subtitles (a tool to print subtitle files offline for translators to mark up by hand — a low-tech solution to a real field constraint).

The first batch of subtitled oral history videos from multiple languages was prepared for upload to Wikimedia Commons. We received confirmation of our participation in the Wikimedia Futures Lab in Frankfurt (January end, 2026).

We also entered a conversation with the Songhay language diaspora community to support mentorship and Wikipedia incubation, reaching outside South Asia for the first time.

November–December 2025 — oral histories on Wikimedia, new partnerships and GLAM collaborations

In November, we participated in Deutsche Welle Akademie’s “The next chapter: Journalism in the age of AI” in Chiang Mai and at FosterLang (Linguapax International), where we discussed language justice in the context of AI and equitable archiving work.

In December, we were invited to present at WikiConference Kerala 2025 on “Digital Tools and Strategy for Indigenous Languages” (recorded). Significant progress was made on Commons uploads: media in multiple languages were processed, subtitled, and embedded into Wikipedia and Wikidata entries. Two reusable Wikimedia Commons templates were created — {{OpenSpeaks}} for audio and video files and {{OpenSpeaks_image}} for slides and images. Both are available to any archivist working in this space.

We also started conversations with two GLAM institutions—one in India and another in Germany. The second one, a public archive, will permanently archive our language materials, assign a DOI, and make the oral history recordings citable on Wikipedia/Wikimedia projects. The archive will also provide field linguist training for Wikimedians and co-create OER on how archivists can contribute language data to their system and onward to Wikimedia, and OpenSpeaks will act as a bridge, not a gatekeeper. Formal agreements will be signed in 2026.

We were invited for a podcast by Radio Taiwan International, as part of their Untranslatables Project series, hosted by Oleksandr Shyn.

January 2026 — Wikimedia Futures Lab, WikiVoice, and a paper accepted

At the Wikimedia Futures Lab in Frankfurt (30 January–1 February), we screened Gyani Maiya, a 2019 documentary about a then-near-extinct and now revived language called Kusunda. We also discussed with the attendees citing oral history in low-resourced languages. Together with Igbo-language Wikimedian Tochi Precious and Biyanto Rebin from ESEAP Hub, we also co-built an alpha prototype of WikiVoice, a proposed platform for community archivists to host their media and generate citable, time-coded references. We also discussed and collected input from other participant-peers.

Earlier in January, we co-authored a paper together with OpenSpeaks Fellows Opino Gomango and Kimmi Pal to Wiki Workshop 2026. The paper—”OpenSpeaks Archives: Citing Low-Resourced Language Oral History Multimedia“—argues that Wikipedia’s policies systematically exclude oral knowledge not because of verification concerns but because of epistemic hierarchies that privilege written, dominant-language sources. The three-parameter framework (FAIR–CARE provenance, community review, multilingual transcription) described in the paper became the foundation for the Oral History Framework released later in March. We also moderated a panel at the AI Impact Summit 2026 pre-Summit on Indigenous languages and small language models. Media in three languages were processed, subtitled, and published on Commons; used in Wikipedia and Wikidata.

February 2026 — Wiki Loves Languages and WikiVoice

In February, we co-launched Wiki Loves Languages 2026 together with the Dagbani Wikimedians User Group, Igbo Wikimedians User Group, Odia Wikimedians User Group, and Wikimedians of Santali Language User Group as a cross-continental collaboration. WikiVoice continued to develop conceptually as a Futures Lab experiment; the model also maps directly onto what the OpenSpeaks Oral History Framework formalises.

March 2026 — the Framework, Wiki Workshop, Subtitler beta launch, and oral histories in practice

The OpenSpeaks Oral History Framework was released in full on Meta-Wiki. It is a practical, three-parameter guide for audio-visual documenters of low- and medium-resourced languages. It’s built on FAIR (Findable, Accessible, Interoperable, Reusable) and CARE (Collective Benefit, Authority to Control, Responsibility, Ethics) principles. The Framework emphasises community peer review, multilingual transcription, and time-coded subtitles together, helping with independent verification. It then leads to citing oral history. A shorter version of the Framework was peer-reviewed at Wiki Workshop 2026 on 25–26 March. The Framework includes a remuneration matrix grounded in equity-centred participatory compensation, explicit guidance for each of the three documenter roles, and a practical subtitle convention aligned with BBC Guidelines and the EBU-TT-D standard.

What does this look like in practice? Two files on Wikimedia Commons illustrate the workflow end-to-end.

A Juray-language interview shares about a public consultation before recording and an independent community review during subtitling.

The first is a Juray-language recording of Manjula Bhuyan, a farmer, discussing how she uses Aadhaar, India’s biometric identity system, and the barriers she faces in accessing public information and welfare services. Opino Gomango, a Sora-speaking researcher and OpenSpeaks Fellow, and I conducted the interview. Juray is endangered, with approximately 25,000 speakers. This video includes time-coded subtitles, structured metadata using the {{OpenSpeaks}} template, and community review by Gomango, a second-language speaker. The interview was originally part of a 2021 documentary film called MarginalizedAadhaar and is now usable as a citation in Wikipedia articles, in addition to being useful within the article.

A Marcha-Rongpo interview uploaded to Wikimedia Commons details a community review process in the production stage.

The second is a Marcha-Rongpo recording of Bimla and K.S. Badwal, a couple living in a city, remembering their childhoods in different villages. Rongpo is a threatened language from India with around 7,500 speakers. The recording was subtitled through a thorough community peer review by Kimmi Pal, a native Marcha speaker and OpenSpeaks Fellow, together with her mother, Bhawani Pal, from the raw recordings. The file was one of the first to go through the full workflow: field recording → consent → community review → subtitle → Commons upload → Wikipedia integration.

OpenSpeaks Subtitler is a linear subtitle editor, allowing multilingual subtitling and contributing them to Commons

On the tools side, the OpenSpeaks Subtitler beta is now live at subtitler.toolforge.org. It is being integrated into Wikimedia Commons, which means archivists can pull a video directly from Commons into the editor, create or improve subtitles, and submit them back to Commons without leaving the tool. It also works offline with local files without Commons integration. Developers are invited to collaborate and contribute; the codebase is open. Community archivists are encouraged to use it and to report any issues or feature requests. The longer-term goal — AI-assisted speech-to-text.

Can you help us translate this article?

In order for this article to reach as many people as possible we would like your help. Can you translate this article to get the message out?

Start translation

Source link