Sarvam AI: Building Multilingual AI for Diverse Patient Populations

Most large language models speak English first and everything else second. For the majority of the world’s population — including significant portions of Canada’s healthcare-seeking population — this is a meaningful gap. Sarvam AI, an Indian AI lab building models specifically for Indian languages, is doing interesting work on a problem that matters beyond India.

I want to explore what their approach means for healthcare data systems serving diverse populations, and what data engineers building such systems should take from it.

What Sarvam Is Building

Sarvam AI has focused on building foundation models for Indian languages — Hindi, Bengali, Tamil, Telugu, Kannada, and others — with particular attention to healthcare and agricultural use cases. Their models are trained on domain-specific data in these languages, not just translated from English-language training data.

This distinction matters enormously. A model trained on translated English healthcare text learns English medical concepts expressed in another language. A model trained on original-language health data learns how people actually describe symptoms, treatments, and health experiences in their language and cultural context.

Why This Matters for Health Data

In my work with CIHI, I’m primarily dealing with administrative health data — structured records, diagnosis codes, procedure codes. The data is mostly numeric and coded. Language diversity is less directly relevant at the data pipeline level.

But health data systems are downstream of patient interactions, and patient interactions happen in many languages. A system that can only process symptom descriptions in English will produce systematically worse data quality for non-English speakers. This affects diagnoses coded, treatments recorded, and ultimately the population-level statistics my pipelines process.

When you work on national health data, these gaps compound. Systematic underrepresentation of non-English-speaking communities in health data isn’t just a fairness issue — it produces incorrect population-level statistics that policy and resource allocation are based on.

The Technical Lessons

Sarvam’s approach has several elements worth learning from:

1. Language-specific training data: They invested in collecting and curating health-domain training data in each target language, rather than translating existing English corpora. For teams building health AI in multilingual contexts, this is the right model — but it requires significant data collection effort that is often skipped.

2. Smaller, domain-specific models: Rather than building a massive general model, Sarvam has built more focused models for specific language and domain combinations. These are cheaper to run and often more accurate for their target use case. For healthcare applications where a large general model is overkill, this is worth considering.

3. Evaluation in context: They evaluate model performance on real health interactions in the target languages, not just on translated benchmark datasets. This seems obvious but is frequently not done.

What This Means for Canadian Context

Canada’s linguistic diversity is different from India’s — primarily English and French at the national level, with significant Indigenous language speakers and large immigrant communities speaking dozens of other languages.

The French-English gap in Canadian health AI is real and underappreciated. Much health AI development happens in English-first contexts, and French-language validation is often an afterthought. For federal health systems serving all Canadians, this is a problem.

For Indigenous language speakers, the gap is even larger. There are essentially no production-quality health AI tools that operate in Cree, Ojibwe, Inuktitut, or the other major Indigenous languages of Canada. This contributes to — and is caused by — systematic underrepresentation of Indigenous communities in health data systems.

Practical Steps

For data engineers building health systems in Canada:

Validate your NLP components in French, not just English. This should be a standard requirement.
Document language coverage as part of your system’s data quality metadata. Know what percentage of records were originally captured in non-primary languages and what processing was applied.
Involve community partners when building systems that will affect specific linguistic communities. Technical competence is not sufficient on its own.

Sarvam’s work is a reminder that building for the full range of people a health system serves requires deliberate investment, not just a general-purpose model. The technical infrastructure of health AI should reflect the linguistic and cultural diversity of the population it serves.