At CIHI, I lead the engineering of a PySpark pipeline that processes the entirety of Canada’s national health data — over 1 billion data points spanning registry, diagnosis, procedure, and pharmaceutical records from every hospital and health authority in the country. This post is about the engineering decisions that made that scale tractable, and the lessons learned along the way.
I’m sharing what I can while keeping specific implementation details confidential. The architectural patterns are general.
The Scale Problem
Population-scale health data has characteristics that distinguish it from typical enterprise data engineering:
- Record volume: 1B+ rows across multiple linked datasets
- Complex joins: Patient records must be linked across data sources without common unique identifiers (health card numbers are jurisdiction-specific, not universal)
- Schema complexity: 200+ parameters per record, many optional, with significant variation between jurisdictions
- Temporal dimension: Multi-year longitudinal data requiring time-series analysis
- Strict compliance: Every transformation must be auditable, and data lineage must be traceable
The naive approach — load everything into memory and transform — doesn’t work. You need to design the pipeline around the constraints from the start.
Architecture Decisions
Partitioning Strategy
The most important architectural decision was partitioning. We partition by province and fiscal year, which matches the natural access patterns (most queries are province-specific or time-bounded) and keeps partition sizes manageable.
# Partition design: province + fiscal_year
df = spark.read.parquet("s3://health-data/registry/")
df_partitioned = df.repartition(
"province_code",
"fiscal_year"
).sortWithinPartitions("patient_id_hash")
The sortWithinPartitions on patient ID is critical for the downstream join operations — it dramatically improves join performance when linking records within the same province-year partition.
Avoiding Full Shuffles
Full shuffles on billion-row datasets are catastrophically expensive. We designed every transformation to avoid them where possible:
-
Broadcast joins for small lookup tables: Diagnosis code descriptions, procedure classifications, and institutional lookups are all small enough to broadcast.
-
Bucketing for large-to-large joins: When joining two large datasets (claims to beneficiary records), we pre-bucket both on patient ID hash so the join is partition-local.
-
Window functions over sort-merge joins: Many aggregations that look like joins can be expressed as window functions, which Spark executes more efficiently.
Incremental Processing
Processing the full dataset from scratch on every run is not feasible — it would take hours and is unnecessary for most use cases. We implemented incremental processing using Delta Lake:
- New fiscal year data is appended to the existing Delta table
- Changed records trigger a merge (upsert) operation
- Historical data is versioned, allowing point-in-time queries for audit purposes
Delta Lake’s ACID transactions are essential here — partial writes in a production health data system are not acceptable.
The Compliance Engineering
Every pipeline step that touches PII must be logged. We implemented this as a decorator pattern:
def audit_transform(transform_name: str, pii_fields: list):
def decorator(func):
def wrapper(df, *args, **kwargs):
log_pipeline_operation(
operation=transform_name,
pii_fields_accessed=pii_fields,
record_count=df.count(),
timestamp=datetime.utcnow()
)
result = func(df, *args, **kwargs)
log_pipeline_completion(
operation=transform_name,
output_record_count=result.count()
)
return result
return wrapper
return decorator
@audit_transform("patient_linkage", pii_fields=["patient_id_hash", "dob_year"])
def link_patient_records(df_claims, df_registry):
# ... linkage logic
pass
This gives us a complete audit trail of every operation that touches PII fields, without embedding compliance logic in the business logic of each transformation.
Performance: Getting Under 60 Minutes
The target was to complete the full CMG Grouper pipeline run — processing up to 24 million records with 200+ parameters — in under 60 minutes. Getting there required several rounds of optimisation.
Key optimisations:
-
Predicate pushdown: Ensure Spark pushes filters down to the file scan level. Column pruning was a significant win — we reduced data read volume by ~40% by selecting only needed columns early.
-
Caching intermediate results: The pipeline has several stages that use the same intermediate DataFrames. Caching these in memory (with spill to disk) eliminated redundant recomputation.
-
Executor configuration tuning: The default Databricks configuration wasn’t optimised for our access pattern. Tuning executor memory, parallelism, and shuffle partition count based on profiling runs reduced total time by ~25%.
-
Eliminating Python UDFs: Original code used Python UDFs for some complex transformations. Rewriting these as native PySpark/SQL expressions (or where necessary, Pandas UDFs) reduced serialisation overhead substantially.
What I’d Do Differently
Start with Delta Lake: We migrated to Delta Lake midway through the project. Starting with it from day one would have saved significant refactoring time.
Design the audit schema first: The audit logging requirements became clear late in development, requiring retrofitting. Designing the audit data model upfront — what needs to be logged, in what format, for what queries — should be a first-day activity.
Invest in test data generation earlier: Building realistic synthetic datasets for unit testing is time-consuming but essential. We underinvested here initially, which slowed development velocity on later pipeline stages.
Takeaway
Processing national-scale health data with PySpark is an engineering challenge, not just a configuration challenge. The patterns that work — careful partitioning, avoiding shuffles, incremental processing, audit-by-design — are general principles that apply to any large-scale data system. But at health data scale, getting them wrong has consequences beyond slow queries.
The 60-minute run time isn’t just a performance target. For the clinicians and policymakers who use the outputs, it’s the difference between data that can inform decisions and data that arrives too late to matter.