One searchable collection
Old AWS and BigData posts plus newer AI/data systems essays share the same archive.
Field notes from building reliable data systems: distributed processing, streaming, model-serving patterns, AWS-era BigData lessons, and the engineering judgment behind production AI.
The old WordPress writing has been converted into Markdown so it can live beside new posts without a database server. That means source-controlled writing, static hosting, and a UI editor when you want one.
Old AWS and BigData posts plus newer AI/data systems essays share the same archive.
Legacy images and attachments are copied locally and old upload links are rewritten.
Decap CMS gives you a browser editor while keeping posts as plain Markdown files.
Recent writing is organized around the practical overlap of AI, distributed systems, data engineering, reliability, and operational clarity.
A practical bridge from big data escalation work to AI platform reliability: the same habits show up in scheduling, observability, data quality, and recovery.
Orchestration helps only after the system can explain itself. Metrics, traces, logs, and workload shape need to come before automation confidence.
Data pipelines need reliability language too. Freshness, completeness, latency, correctness, and recoverability are better signals than green checkmarks.
The expensive part of an AI platform is not only the accelerator. It is the end-to-end path that keeps training and inference workloads fed, observable, and recoverable.
Some Spark incidents do not begin with a clean error. They begin as shape changes: skew, executor churn, storage pressure, and queues that stop draining.
A senior escalation playbook should reduce uncertainty: who is affected, which layer is failing, what changed, and what the safest next move is.
The site is positioned as a technical notebook, not a job pitch: the core themes are durable systems, data movement, serving paths, and the reality of operating at scale.