From EMR Escalations to AI Platform Reliability
A practical bridge from big data escalation work to AI platform reliability: the same habits show up in scheduling, observability, data quality, and recovery.
Search the old WordPress posts and the new AI/data systems notes from one place.
A practical bridge from big data escalation work to AI platform reliability: the same habits show up in scheduling, observability, data quality, and recovery.
Orchestration helps only after the system can explain itself. Metrics, traces, logs, and workload shape need to come before automation confidence.
Data pipelines need reliability language too. Freshness, completeness, latency, correctness, and recoverability are better signals than green checkmarks.
The expensive part of an AI platform is not only the accelerator. It is the end-to-end path that keeps training and inference workloads fed, observable, and recoverable.
Some Spark incidents do not begin with a clean error. They begin as shape changes: skew, executor churn, storage pressure, and queues that stop draining.
A senior escalation playbook should reduce uncertainty: who is affected, which layer is failing, what changed, and what the safest next move is.
AI infrastructure inherits old distributed systems tradeoffs: partitioning, coordination, consistency, backpressure, and the operational cost of being almost right.
Streaming incidents usually have a shape: producer pressure, partition imbalance, consumer lag, checkpoint pain, or downstream systems that cannot keep up.
Serverless data tools remove servers from the interface, not from the system. Layout, metadata, file size, and query shape still decide the bill and the user experience.
Stateful systems turn small assumptions into operational puzzles. The fastest path is to separate scheduler symptoms from storage symptoms early.
Before big data escalations, there were routers: throughput limits, capacity enforcement, UNIX tooling, and the habit of respecting the data plane.
The Logstash plugin for Amazon DynamoDB gives you a nearly real-time view of the data in your DynamoDB table. The Logstash plugin for DynamoDB uses DynamoDB Streams to parse and output data as it is added to a DynamoD...
How Data-Pipeline installs taskrunner on Ec2 instance? Data-pipeline launches an Ec2 instances on your behalf using with the following user-data script. ------------------------------------------------- #!/bin/bash se...
The AWS public facing documentation provides some python examples to sign the http reqests with IAM users's to access other AWS resources. In this case, AWS ES cluster whose access policies are restricted to those IAM...
data.police.uk provides a complete snapshot of crime, outcome, and stop and search data, as held by the Home Office at a particular point in history. The actual data is located on S3 under bucket policeuk-data and can...
Why Log aggregation ? User logs of Hadoop jobs serve multiple purposes. First and foremost, they can be used to debug issues while running a MapReduce application – correctness problems with the application itself, ra...
The DynamoDB Session Handler is a custom session handler for PHP that allows developers to use Amazon DynamoDB as a session store. Using DynamoDB for session storage alleviates issues that occur with session handling ...
I observed that exporting large Hbase tables with Hbase provided 'Export' utility is very high CPU bound. If you are using default cluster configurations, the mappers may consume 100% CPU and may crash the regionServe...
Often times the export/import activity may be limited on several performance bottlenecks. So, the activity may be faster if a distributed transfer is used instead of normal transfer. Some of the bottlenecks include Re...
You may look at /mnt/var/lib/info/ on Master node to find lot of info about your EMR cluster setup. More specifically /mnt/var/lib/info/job-flow.json contains the jobFlowId or ClusterID. I was able to install a JSON p...
While copying data from RDS to Redshift.. To avoid data loss, start the 'Incremental copy template' before the 'Full copy' A sample implementation can be, ------------------------------------------------- Incremental ...
Export hive metastore to s3 on a schedule.
aws s3 ls s3://bucket/folder --recursive | awk 'BEGIN {total=0}{total+=$3}END{print total/1024/1024" MB"}'
Using Amazon ML to Predict Responses to a Marketing Offer: With Amazon Machine Learning (Amazon ML), you can build and train predictive applications and host your applications in a scalable cloud solution.In this tuto...
Financial services grid computing on the cloud provides dynamic scalability and elasticity for operation when compute jobs are required, and utilizing services for aggregation that simplify the development of grid sof...
A cost-effective online game architecture featuring automatic capacity adjustment, a highly available and high-speed database, and a data processing cluster for player behavior analysis.
When data arrives as a succession of regular measurements, it is known as time series information. Processing of time series information poses systems scaling challenges that the elasticity of AWS services is uniquely...
How to build a scalable and reliable large-scale log analytics platform with EMR
Amazon Machine Learning Intro
Avoiding fail over of master node in Redis cluster.
DynamoDB Intro
Elasticache Intro
Kinesis Intro
EMR best practices
Shell activity to Create/Insert MySQL table with Data Pipelines