Notes | Built at Scale

May 2026

AI Infrastructure

From EMR Escalations to AI Platform Reliability

A practical bridge from big data escalation work to AI platform reliability: the same habits show up in scheduling, observability, data quality, and recovery.

Mar 2026

Reliability

Observability Before Orchestration

Orchestration helps only after the system can explain itself. Metrics, traces, logs, and workload shape need to come before automation confidence.

Jan 2026

Cloud Data

SLO Thinking for Data Pipelines That Feed AI Systems

Data pipelines need reliability language too. Freshness, completeness, latency, correctness, and recoverability are better signals than green checkmarks.

Nov 2025

AI Infrastructure

The GPU Cluster Has a Data Problem First

The expensive part of an AI platform is not only the accelerator. It is the end-to-end path that keeps training and inference workloads fed, observable, and recoverable.

Aug 2025

EMR and Spark

When Spark Fails Quietly Before the Logs Agree

Some Spark incidents do not begin with a clean error. They begin as shape changes: skew, executor churn, storage pressure, and queues that stop draining.

Mar 2025

Reliability

EMR Escalation Playbooks and Enterprise Blast Radius

A senior escalation playbook should reduce uncertainty: who is affected, which layer is failing, what changed, and what the safest next move is.

Dec 2024

Distributed Systems

What Dynamo-Style Systems Teach AI Infrastructure Teams

AI infrastructure inherits old distributed systems tradeoffs: partitioning, coordination, consistency, backpressure, and the operational cost of being almost right.

May 2024

Streaming

Kinesis, Kafka, and the Shape of Streaming Incidents

Streaming incidents usually have a shape: producer pressure, partition imbalance, consumer lag, checkpoint pain, or downstream systems that cannot keep up.

Oct 2023

Cloud Data

Glue, Athena, and the Hidden Cost of Convenience

Serverless data tools remove servers from the interface, not from the system. Layout, metadata, file size, and query shape still decide the bill and the user experience.

Mar 2023

Distributed Systems

HBase, YARN, and the Art of Debugging Stateful Systems

Stateful systems turn small assumptions into operational puzzles. The fastest path is to separate scheduler symptoms from storage symptoms early.

Aug 2022

Networks

What Router Data Planes Taught Me About Big Data Systems

Before big data escalations, there were routers: throughput limits, capacity enforcement, UNIX tooling, and the habit of respecting the data plane.

Dec 2015

Dynamo DB

Search DynamoDB tables using Elasticsearch/Kibana via Logstash plugin

The Logstash plugin for Amazon DynamoDB gives you a nearly real-time view of the data in your DynamoDB table. The Logstash plugin for DynamoDB uses DynamoDB Streams to parse and output data as it is added to a DynamoD...

Dec 2015

Data Pipelines

All about AWS Data-Pipelines Taskrunner

How Data-Pipeline installs taskrunner on Ec2 instance? Data-pipeline launches an Ec2 instances on your behalf using with the following user-data script. ------------------------------------------------- #!/bin/bash se...

Dec 2015

Elasticsearch

Query AWS ES cluster by signing http requests with AWS IAM roles (python)

The AWS public facing documentation provides some python examples to sign the http reqests with IAM users's to access other AWS resources. In this case, AWS ES cluster whose access policies are restricted to those IAM...

Nov 2015

Data-sets

UK police data

data.police.uk provides a complete snapshot of crime, outcome, and stop and search data, as held by the Home Office at a particular point in history. The actual data is located on S3 under bucket policeuk-data and can...

Nov 2015

EMR || Elastic Map Reduce

YARN Log aggregation on EMR Cluster - How to ?

Why Log aggregation ? User logs of Hadoop jobs serve multiple purposes. First and foremost, they can be used to debug issues while running a MapReduce application – correctness problems with the application itself, ra...

Nov 2015

Dynamo DB

Using DynamoDB as session provider with AWS SDK V3

The DynamoDB Session Handler is a custom session handler for PHP that allows developers to use Amazon DynamoDB as a session store. Using DynamoDB for session storage alleviates issues that occur with session handling ...

Sep 2015

EMR || Elastic Map Reduce

hbase snapshot / export

I observed that exporting large Hbase tables with Hbase provided 'Export' utility is very high CPU bound. If you are using default cluster configurations, the mappers may consume 100% CPU and may crash the regionServe...

Sep 2015

EMR || Elastic Map Reduce

Installing and using Apache sqoop to export/Import large datasets (MySQL, S3) (CSV/TSV..) on EMR cluster

Often times the export/import activity may be limited on several performance bottlenecks. So, the activity may be faster if a distributed transfer is used instead of normal transfer. Some of the bottlenecks include Re...

Sep 2015

EMR || Elastic Map Reduce

How to retrieve Cluster ID / JobFlow ID from EMR master node.

You may look at /mnt/var/lib/info/ on Master node to find lot of info about your EMR cluster setup. More specifically /mnt/var/lib/info/job-flow.json contains the jobFlowId or ClusterID. I was able to install a JSON p...

Aug 2015

Data Pipelines

Incremental Load: avoiding data loss

While copying data from RDS to Redshift.. To avoid data loss, start the 'Incremental copy template' before the 'Full copy' A sample implementation can be, ------------------------------------------------- Incremental ...

Jun 2015

S3

Finding size of AWS s3 bucket/object

aws s3 ls s3://bucket/folder --recursive | awk 'BEGIN {total=0}{total+=$3}END{print total/1024/1024" MB"}'

Jun 2015

Data Pipelines

Export hive metastore to s3 on a schedule.

Jun 2015

Demos

AML Demo

Using Amazon ML to Predict Responses to a Marketing Offer: With Amazon Machine Learning (Amazon ML), you can build and train predictive applications and host your applications in a scalable cloud solution.In this tuto...

Jun 2015

Achitectures

Online Game

A cost-effective online game architecture featuring automatic capacity adjustment, a highly available and high-speed database, and a data processing cluster for player behavior analysis.

Jun 2015

Achitectures

FINANCIAL SERVICES GRID COMPUTING

Financial services grid computing on the cloud provides dynamic scalability and elasticity for operation when compute jobs are required, and utilizing services for aggregation that simplify the development of grid sof...

Jun 2015

Achitectures

TIME SERIES PROCESSING

When data arrives as a succession of regular measurements, it is known as time series information. Processing of time series information poses systems scaling challenges that the elasticity of AWS services is uniquely...

Jun 2015

Achitectures