Cloud • Data • AI Infrastructure

Systems thinking for infrastructure built at scale.

Architecture notes on distributed data platforms, cloud reliability, Spark and EMR, ML infrastructure, observability, and the production tradeoffs behind modern AI systems.

Data Platforms

Practical lessons from Spark, Hadoop, EMR, Hive, HBase, Glue, Athena, Kinesis, and production-scale data systems.

AI Infrastructure

Notes on the systems beneath AI: storage, compute, orchestration, pipelines, inference, observability, and reliability.

Reliability at Scale

How complex systems fail, how to debug them, and how to design architectures that survive real production pressure.

Architecture Notes

Deep technical writing for engineers, architects, and teams building cloud, data, and AI systems in production.

Coming soon
Distributed Systems

What Breaks First in Distributed Data Systems

A practical look at the failure modes that show up when Spark, storage, schedulers, and humans meet production scale.

Coming soon
Cloud Architecture

EMR, Glue, Athena, or Spark on Kubernetes?

How I think about choosing the right AWS data architecture based on workload shape, operations, cost, and team maturity.

Coming soon
AI Infrastructure

The Systems Beneath Modern AI

LLMs look magical at the top, but underneath are queues, schedulers, storage, GPUs, caches, metrics, and reliability tradeoffs.

About

Built at Scale is a technical site by Raja Mannem focused on cloud, data, distributed systems, and AI infrastructure.

Topics I write about

Production architecture, distributed systems debugging, Spark and EMR, data platform reliability, AI/ML infrastructure, observability, cost-performance tradeoffs, and engineering lessons from operating complex systems.

AWS EMR Spark Hadoop SageMaker Glue Kinesis DynamoDB Observability AI Infra