N
Neel Shah
Writing

Technical blog on
data engineering & AI tools

Practical writing on PySpark at scale, PII compliance, cloud data architecture, and using modern AI tools to build faster.

10 posts PySpark AI Tools Healthcare & Finance
Series

PySpark in Production

Oct 15, 2025 11 min read

Processing 1 Billion+ Health Records with PySpark: Architecture and Lessons

A deep technical walkthrough of the architecture decisions, partitioning strategies, and hard-won lessons from building a PySpark pipeline that processes Canada's national health dataset at population scale.

PySparkHealthcareBig Data
Sep 15, 2025 10 min read

PII Compliance at Scale: PIPEDA and Health Data Privacy with PySpark

A practical engineering guide to building PySpark pipelines that handle Personal Health Information in compliance with PIPEDA and provincial health privacy legislation in Canada.

PySparkPIPEDAPII
Aug 15, 2025 10 min read

Real-time Credit Risk Monitoring with PySpark Streaming in Financial Services

How we built a PySpark Structured Streaming pipeline to process 1 million financial transactions per hour for credit risk monitoring across Apple Card, Walmart Card, and GM Card portfolios.

PySparkStreamingCredit Risk
Jul 15, 2025 9 min read

Databricks + PySpark for Government Health Data: Architecture Considerations

What changes when your PySpark workloads run on Databricks for a government health client — governance requirements, Unity Catalog, workspace isolation, and the procurement reality.

DatabricksPySparkGovernment
Jun 15, 2025 10 min read

Optimising PySpark Jobs for Large-Scale Financial Transaction Processing

Performance tuning patterns for PySpark pipelines handling hundreds of millions of financial transactions — partitioning strategy, join optimisation, and the profiling workflow that surfaces real bottlenecks.

PySparkPerformanceFinancial Services
🤖
Series

AI Tools for Data Engineers

Mar 15, 2026 8 min read

How I Use Claude to Accelerate Enterprise Data Engineering

A practical look at how Claude fits into real data engineering workflows — from PySpark code review to PII audit documentation — and why it's become an indispensable productivity tool.

ClaudeData EngineeringPySpark
Feb 15, 2026 7 min read

Gemini for Data Analysis: A Practical Review for Enterprise Teams

After testing Google Gemini on real data engineering tasks — schema inference, SQL generation, and multi-modal data documentation — here's an honest assessment of where it fits in an enterprise stack.

GeminiGoogle AIData Analysis
Jan 15, 2026 9 min read

Integrating OpenAI into Healthcare Data Pipelines: Lessons Learned

Real lessons from integrating GPT-4 into health data workflows — what works, what requires careful safeguards, and the compliance questions you need to answer before you start.

OpenAIGPT-4Healthcare
Dec 15, 2025 6 min read

Perplexity AI as a Research Tool for Compliance and Regulatory Work

Perplexity's cited, real-time search makes it surprisingly useful for staying current on Canadian health privacy regulations, PIPEDA updates, and financial compliance requirements.

PerplexityCompliancePIPEDA
Nov 15, 2025 7 min read

Sarvam AI: Building Multilingual AI for Diverse Patient Populations

Sarvam AI's focus on Indian languages and healthcare has lessons for any system serving linguistically diverse populations — and raises important questions about representation in health AI.

Sarvam AIMultilingualHealthcare AI