AI systems, data platforms, and BigData notes.

Field notes from building reliable data systems: distributed processing, streaming, model-serving patterns, AWS-era BigData lessons, and the engineering judgment behind production AI.

Archive, now usable.

The old WordPress writing has been converted into Markdown so it can live beside new posts without a database server. That means source-controlled writing, static hosting, and a UI editor when you want one.

One searchable collection

Old AWS and BigData posts plus newer AI/data systems essays share the same archive.

WordPress content preserved

Legacy images and attachments are copied locally and old upload links are rewritten.

Write through a UI

Decap CMS gives you a browser editor while keeping posts as plain Markdown files.

Latest notes.

Recent writing is organized around the practical overlap of AI, distributed systems, data engineering, reliability, and operational clarity.

From EMR Escalations to AI Platform Reliability

A practical bridge from big data escalation work to AI platform reliability: the same habits show up in scheduling, observability, data quality, and recovery.

ML InfrastructureAmazon EMRReliability

Observability Before Orchestration

Orchestration helps only after the system can explain itself. Metrics, traces, logs, and workload shape need to come before automation confidence.

ObservabilityKubernetesSRE

SLO Thinking for Data Pipelines That Feed AI Systems

Data pipelines need reliability language too. Freshness, completeness, latency, correctness, and recoverability are better signals than green checkmarks.

SLOsData QualityAI Systems

The GPU Cluster Has a Data Problem First

The expensive part of an AI platform is not only the accelerator. It is the end-to-end path that keeps training and inference workloads fed, observable, and recoverable.

SageMakerData PipelinesScheduling

When Spark Fails Quietly Before the Logs Agree

Some Spark incidents do not begin with a clean error. They begin as shape changes: skew, executor churn, storage pressure, and queues that stop draining.

SparkAmazon EMRYARN

EMR Escalation Playbooks and Enterprise Blast Radius

A senior escalation playbook should reduce uncertainty: who is affected, which layer is failing, what changed, and what the safest next move is.

Amazon EMROperationsEnterprise Support

Recurring threads.

The site is positioned as a technical notebook, not a job pitch: the core themes are durable systems, data movement, serving paths, and the reality of operating at scale.