Large-scale data systems for
high-stakes organisations
Senior Data Engineer and Technical Consultant specialising in PySpark, Python, and cloud infrastructure for Government, Healthcare, and Financial Services. Based in Ottawa, Canada.
I'm Neel Shah — a Senior Data Engineer and Technical Consultant with 10+ years building end-to-end data systems for organisations that handle sensitive, high-volume data. My core expertise is PySpark, Python, and cloud infrastructure — delivering reliable, compliant, and scalable pipelines across Government, Healthcare, and Financial Services.
As Tech Lead at CIHI (Canadian Institute for Health Information), I lead large-scale PySpark pipelines processing over 1 billion Canadian health data points — covering national registry, diagnosis, and pharma datasets — for federal government, provincial governments, and NPO clients. I manage client relationships, lead the engineering team end-to-end, and ensure strict compliance with PIPEDA and provincial health privacy legislation. This means PII governance, audit trails, and data security are part of every technical decision I make.
Before CIHI, at EXL Service embedded at Goldman Sachs, I built PySpark-based credit risk management platforms handling 1 million financial transactions per hour — powering Apple Card, Walmart Card, and GM Card risk decisioning with full regulatory compliance. I also architected cloud-based systems at Canopy Growth, Manulife, and SITA — building high-availability infrastructure across Azure and AWS for millions of users.
I use modern AI tools — Claude, GPT, and local LLMs — as productivity accelerators: faster code reviews, automated documentation, data validation pipelines, and intelligent development workflows. AI makes my engineering output faster and higher quality — it's a tool, not the product.
I also created emot — an early open source contribution that grew to 3M+ downloads. It's a reminder that the best tools solve one thing really well.
Originally from Vadodara, India — graduated 1st in my engineering class — I moved to Canada for graduate studies and have contributed to both the tech community and volunteer AI initiatives since.
- 📍 Ottawa, Ontario, Canada
- 🏢 CIHI (current)
- 🎓 Lakehead University
- 💻 10+ years experience
- 📄 3 research papers · 89+ citations
- 📦 3M+ open source downloads
- 🌍 5 languages
- English Native
- Hindi Native
- Gujarati Native
- French Elementary
- Sanskrit Limited
Technical Skills
PySpark & Big Data
PII & Compliance
Cloud & Infrastructure
AI-Accelerated Development
Additional Skills
Experience
Lead large-scale PySpark ETL pipeline processing up to 24M records with 200+ parameters in under 60 minutes — ingesting 1B+ Canadian health data points (registry, diagnosis, pharma) from every hospital in the country. Serve federal/provincial government and NPO clients, lead cross-functional engineering team end-to-end through SDLC, manage client relationships, and enforce PIPEDA and provincial health privacy compliance.
Built PySpark-based credit risk management platform handling Apple Card, Walmart Card, and GM Card portfolios at 1M transactions/hour. Resolved P-0 production incidents saving $10M+ in risk exposure. Built Python test automation framework reducing end-to-end test time by 60%.
Architected and led a full waterfall-to-agile transformation across 42 company websites. Designed Python/FastAPI/Docker/AWS microservice handling 100K requests/hour. Delivered $5M/year in annual cost savings through AEM virtualisation. Built WCAG accessibility compliance tooling across the full digital estate.
Built and maintained Azure cloud infrastructure of 1,800+ servers (Windows & Linux) with 99.99% uptime SLA. Developed real-time Power BI dashboards for infrastructure monitoring. Automated CI/CD pipeline with Python and Docker, reducing debugging time by 45 minutes.
Designed real-time airport analytical system integrating LiDAR and Camera hardware using Python and reactive programming. Led Python 2→3 migration of large-scale airport systems. Transformed monolithic legacy architecture into cloud-based microservices on Azure.
Published 3 peer-reviewed papers (89+ citations, NSERC-funded). Built 20-node Elasticsearch cluster searching 330M tweets/second for real-time public health analytics. Developed Random Forest NLP model achieving 93.4% accuracy for population-level health classification.
Built asynchronous chatbot analytics API handling thousands of requests/second. Developed 5+ real-time AWS dashboards for semantic analysis and topic extraction. Designed clustering algorithm for chat-based decision support.
Built real-time data analysis system for product cost and logistics using SAP and Python. Developed time-series sales forecasting model achieving 71% prediction efficiency. Designed ETL and report generation pipeline for sales, cost, and inventory data.
Education
Beyond Work
Cycling
Ottawa's cycling paths are underrated. Long rides clear the head after a week of debugging PySpark DAGs.
Walking
Walking is where problems get solved. Some of the best architecture decisions happen away from the screen.
Gym
Consistency at the gym mirrors consistency in engineering. Show up, do the work, trust the compound effect.