35 notes

The archive.

Search the old WordPress posts and the new AI/data systems notes from one place.

May 2026
AI Infrastructure

From EMR Escalations to AI Platform Reliability

A practical bridge from big data escalation work to AI platform reliability: the same habits show up in scheduling, observability, data quality, and recovery.

ML InfrastructureAmazon EMRReliabilitySLOs
Mar 2026
Reliability

Observability Before Orchestration

Orchestration helps only after the system can explain itself. Metrics, traces, logs, and workload shape need to come before automation confidence.

ObservabilityKubernetesSREData Pipelines
Jan 2026
Cloud Data

SLO Thinking for Data Pipelines That Feed AI Systems

Data pipelines need reliability language too. Freshness, completeness, latency, correctness, and recoverability are better signals than green checkmarks.

SLOsData QualityAI SystemsPipelines
Nov 2025
AI Infrastructure

The GPU Cluster Has a Data Problem First

The expensive part of an AI platform is not only the accelerator. It is the end-to-end path that keeps training and inference workloads fed, observable, and recoverable.

SageMakerData PipelinesSchedulingReliability
Aug 2025
EMR and Spark

When Spark Fails Quietly Before the Logs Agree

Some Spark incidents do not begin with a clean error. They begin as shape changes: skew, executor churn, storage pressure, and queues that stop draining.

SparkAmazon EMRYARNEscalations
Mar 2025
Reliability

EMR Escalation Playbooks and Enterprise Blast Radius

A senior escalation playbook should reduce uncertainty: who is affected, which layer is failing, what changed, and what the safest next move is.

Amazon EMROperationsEnterprise SupportRunbooks
Dec 2024
Distributed Systems

What Dynamo-Style Systems Teach AI Infrastructure Teams

AI infrastructure inherits old distributed systems tradeoffs: partitioning, coordination, consistency, backpressure, and the operational cost of being almost right.

DynamoDBDistributed StorageConsistencyAI Platforms
Oct 2023
Cloud Data

Glue, Athena, and the Hidden Cost of Convenience

Serverless data tools remove servers from the interface, not from the system. Layout, metadata, file size, and query shape still decide the bill and the user experience.

AWS GlueAthenaMetadataCost
Dec 2015
Data Pipelines

All about AWS Data-Pipelines Taskrunner

How Data-Pipeline installs taskrunner on Ec2 instance? Data-pipeline launches an Ec2 instances on your behalf using with the following user-data script. ------------------------------------------------- #!/bin/bash se...

awspipelinesdatataskrunner
Nov 2015
Data-sets

UK police data

data.police.uk provides a complete snapshot of crime, outcome, and stop and search data, as held by the Home Office at a particular point in history. The actual data is located on S3 under bucket policeuk-data and can...

ukpolicedatastreet
Nov 2015
EMR || Elastic Map Reduce

YARN Log aggregation on EMR Cluster - How to ?

Why Log aggregation ? User logs of Hadoop jobs serve multiple purposes. First and foremost, they can be used to debug issues while running a MapReduce application – correctness problems with the application itself, ra...

loghadoopaggregationlog-aggregation
Nov 2015
Dynamo DB

Using DynamoDB as session provider with AWS SDK V3

The DynamoDB Session Handler is a custom session handler for PHP that allows developers to use Amazon DynamoDB as a session store. Using DynamoDB for session storage alleviates issues that occur with session handling ...

DynamoDB
Sep 2015
EMR || Elastic Map Reduce

hbase snapshot / export

I observed that exporting large Hbase tables with Hbase provided 'Export' utility is very high CPU bound. If you are using default cluster configurations, the mappers may consume 100% CPU and may crash the regionServe...

emrhbaseexportcopy
Sep 2015
EMR || Elastic Map Reduce

How to retrieve Cluster ID / JobFlow ID from EMR master node.

You may look at /mnt/var/lib/info/ on Master node to find lot of info about your EMR cluster setup. More specifically /mnt/var/lib/info/job-flow.json contains the jobFlowId or ClusterID. I was able to install a JSON p...

Aug 2015
Data Pipelines

Incremental Load: avoiding data loss

While copying data from RDS to Redshift.. To avoid data loss, start the 'Incremental copy template' before the 'Full copy' A sample implementation can be, ------------------------------------------------- Incremental ...

Jun 2015
Demos

AML Demo

Using Amazon ML to Predict Responses to a Marketing Offer: With Amazon Machine Learning (Amazon ML), you can build and train predictive applications and host your applications in a scalable cloud solution.In this tuto...

Jun 2015
Achitectures

FINANCIAL SERVICES GRID COMPUTING

Financial services grid computing on the cloud provides dynamic scalability and elasticity for operation when compute jobs are required, and utilizing services for aggregation that simplify the development of grid sof...

emrdyanmoDBfinancial
Jun 2015
Achitectures

Online Game

A cost-effective online game architecture featuring automatic capacity adjustment, a highly available and high-speed database, and a data processing cluster for player behavior analysis.

emrDynamoDBGamingonline
Jun 2015
Achitectures

TIME SERIES PROCESSING

When data arrives as a succession of regular measurements, it is known as time series information. Processing of time series information poses systems scaling challenges that the elasticity of AWS services is uniquely...

emrtimeDynamoDBpipelines
Jun 2015
Achitectures

WEB LOG ANALYSIS

How to build a scalable and reliable large-scale log analytics platform with EMR

emrrdsloganalysis
Apr 2015
EMR || Elastic Map Reduce

EMR best practices

EMR best practices

emrbest practicespdfaws