AWS Monitoring & Observability: CloudWatch, CloudTrail, X-Ray and More

Once an app is running on AWS, the next question is: how do you know if it is actually okay? AWS gives you a handful of services to watch performance, audit who did what, trace requests across systems, and check the health of AWS itself.

The mental map

┌─────────────────────────────────────────────────────────┐
│  WATCH performance & react                              │
│  • CloudWatch Metrics  (numbers: CPU, network, billing) │
│  • CloudWatch Alarms   (trigger on a metric)            │
│  • CloudWatch Logs     (collect log files)              │
│  • EventBridge         (react to events / schedule)     │
├─────────────────────────────────────────────────────────┤
│  AUDIT who did what                                     │
│  • CloudTrail          (API call history / audit)       │
├─────────────────────────────────────────────────────────┤
│  TRACE & analyze app behavior                           │
│  • X-Ray               (trace requests across services) │
│  • CodeGuru            (ML code review + profiling)     │
├─────────────────────────────────────────────────────────┤
│  CHECK service health                                   │
│  • Health Dashboard (Service)  (all AWS, all regions)   │
│  • Health Dashboard (Account)  (events impacting YOU)   │
└─────────────────────────────────────────────────────────┘

CloudWatch — the core monitoring service

Metrics

CloudWatch Metrics are numbers tracked over time — CPUUtilization, NetworkIn, and so on. Every AWS service emits metrics, all timestamped, and you can build dashboards from them.

Important metrics to know:

EC2: CPU, status checks, and network — not RAM, since RAM isn’t a default metric. The default frequency is every 5 minutes; Detailed Monitoring (paid) drops it to every 1 minute.
EBS: disk reads and writes.
S3: BucketSizeBytes, NumberOfObjects, AllRequests.
Billing: Total Estimated Charge — only in us-east-1.
Custom metrics: push your own.

Alarms

CloudWatch Alarms trigger actions when a metric crosses a threshold.

Actions: Auto Scaling (change the desired count), EC2 Actions (stop, terminate, reboot, recover), or SNS notification.
States: OK, INSUFFICIENT_DATA, ALARM.
Classic example: a billing alarm that notifies you when spend exceeds $X.

Logs

CloudWatch Logs collects log files from Beanstalk, ECS, Lambda, CloudTrail, EC2 and on-prem (via the CloudWatch agent), and Route 53.

EC2 needs the CloudWatch agent installed to push logs — by default, no EC2 logs flow to CloudWatch. It also needs the right IAM permissions.
The agent works on-prem too.
This enables real-time log monitoring.

EventBridge (formerly CloudWatch Events)

EventBridge lets you react to events or run things on a schedule.

Source (S3 upload, EC2 state, schedule, CloudTrail API call)
   │
   ▼
EventBridge (rule matches)
   │
   ▼
Target (Lambda, SQS, SNS, Step Functions, ECS, etc.)

Two trigger types:

Schedule (cron) → “every hour, run this Lambda.”
Event pattern → “when the root user signs in, send an SNS alert.”

Event buses:

Default — AWS service events.
Partner — SaaS apps like Zendesk and Datadog.
Custom — your own apps.

It also has a schema registry and supports archiving and replaying events.

Mental model: EventBridge is the “if this AWS thing happens, then do that” router — the backbone of event-driven automation.

CloudTrail — audit log

CloudTrail records every API call or action in your AWS account — who did what, when, and from where.

Console / SDK / CLI / IAM users & roles
   │
   ▼
CloudTrail (records the action)
   │
   ├──► CloudWatch Logs
   └──► S3 bucket

Key facts:

Enabled by default.
Used for governance, compliance, and audit.
Applies to all regions by default, or one region if you choose.
The rule of thumb: if a resource is deleted in AWS, check CloudTrail first — it tells you who deleted it and when.

CloudWatch vs CloudTrail (commonly confused):

CloudWatch = performance, metrics, logs — how is it doing?
CloudTrail = audit, API history — who did what?

X-Ray — distributed tracing

X-Ray traces a request as it flows through your services.

The problem it solves: in a microservice or distributed app, a single request hits EC2 → DynamoDB → SNS → and so on. When it is slow, where is the bottleneck? Logs alone can’t tell you. X-Ray draws a visual service map with latency at each hop.

Client → EC2 (70ms) → DynamoDB (30ms)
                    → SNS (43ms)

Advantages: troubleshoot bottlenecks, understand dependencies, pinpoint service issues, find errors and exceptions, check SLAs, see where you are being throttled, and identify impacted users.

Use it whenever you have a distributed or microservice app and can’t tell which service is slow.

CodeGuru — ML-powered code analysis

CodeGuru has two parts, covering two phases of the dev lifecycle:

Coding ──► Build & Test ──► Deploy ──► Measure
   │                                       │
CodeGuru Reviewer                    CodeGuru Profiler
(static analysis,                    (runtime performance,
 pre-prod)                            production)

CodeGuru Reviewer — automated code reviews via static analysis:

Finds bugs, security vulnerabilities, and resource leaks.
ML model trained on millions of code reviews.
Supports Java and Python.
Integrates with GitHub, Bitbucket, and CodeCommit.

CodeGuru Profiler — runtime performance in production:

Finds code inefficiencies and excessive CPU use.
Reduces CPU and compute costs.
Provides heap summaries and anomaly detection.
Works on AWS or on-prem.

Mental model: Reviewer checks your code before prod; Profiler watches your code in prod.

Health Dashboards — is AWS itself okay?

There are two health dashboards, and they are often confused:

Service Health Dashboard (formerly the AWS Service Health Dashboard):

The status of all AWS services across all regions — the general AWS status.
Historical info per day, plus an RSS feed.
Answers: “is AWS having an outage somewhere?”

Account Health Dashboard (formerly the Personal Health Dashboard, PHD):

Events that specifically impact YOUR resources.
Personalized alerts and remediation guidance.
Can aggregate across an entire AWS Organization.
Answers: “is anything AWS is doing going to affect me?”

The distinction: Service = global AWS status. Account = personalized to your account.

How they fit together

                  Your Application
                        │
   ┌────────────────────┼────────────────────┐
   │                    │                     │
   ▼                    ▼                     ▼
CloudWatch         CloudTrail              X-Ray
(metrics, logs,    (who did what,          (trace requests
 alarms)            audit trail)            across services)
   │
   ▼
EventBridge ──► automate reactions (Lambda, SNS, etc.)

Meanwhile:

Health Dashboards ──► is AWS itself healthy? (service-wide / your account)
CodeGuru ──► is your code good & efficient?

Decision shortcuts

Need	Service
Monitor CPU/network/performance	CloudWatch Metrics
Get notified when a metric crosses a threshold	CloudWatch Alarms
Collect and search application logs	CloudWatch Logs
React to an event or run on a schedule	EventBridge
Find out who deleted/changed a resource	CloudTrail
Debug slowness across microservices	X-Ray
Automated code review / find bugs	CodeGuru Reviewer
Find runtime performance issues in prod	CodeGuru Profiler
Is AWS having an outage?	Service Health Dashboard
Is AWS doing something that affects my resources?	Account Health Dashboard

The big mental split

PERFORMANCE  →  CloudWatch (metrics, logs, alarms)
AUTOMATION   →  EventBridge (react to events)
AUDIT        →  CloudTrail (who did what)
TRACING      →  X-Ray (where's the bottleneck)
CODE QUALITY →  CodeGuru (review + profile)
AWS STATUS   →  Health Dashboards (service-wide / your account)

Summary

CloudWatch tells you how your app is performing — metrics, alarms, and logs.
EventBridge reacts automatically — to events or on a schedule.
CloudTrail tells you who did what — the audit trail.
X-Ray tells you where requests slow down in a distributed system.
CodeGuru reviews your code (Reviewer) and profiles it in production (Profiler).
Health Dashboards tell you whether AWS itself is healthy — globally, or for your account specifically.
Pick by the question you are asking: performance, audit, tracing, automation, code quality, or AWS status.

The mental map#

CloudWatch — the core monitoring service#

Metrics#

Alarms#

Logs#

EventBridge (formerly CloudWatch Events)#

CloudTrail — audit log#

X-Ray — distributed tracing#

CodeGuru — ML-powered code analysis#

Health Dashboards — is AWS itself okay?#

How they fit together#

Decision shortcuts#

The big mental split#

Summary#