aws-monitoring
About
This AWS monitoring skill helps developers debug production issues and deployment failures in AWS environments. It enables checking Lambda logs, monitoring CloudWatch metrics, and troubleshooting resource problems using tools like the SST console. Use it when investigating errors, analyzing API performance, or setting up monitoring alarms.
Documentation
AWS Monitoring Skill
This skill helps you monitor and debug AWS resources for the SG Cars Trends platform.
When to Use This Skill
- Investigating production errors
- Checking Lambda function logs
- Monitoring API performance
- Debugging deployment failures
- Analyzing CloudWatch metrics
- Setting up alarms
- Troubleshooting resource issues
Monitoring Tools
SST Console
SST provides a built-in console for monitoring:
# Open SST console for specific stage
npx sst console --stage production
npx sst console --stage staging
npx sst console --stage dev
Features:
- Real-time Lambda logs
- Function invocations
- Error tracking
- Resource overview
- Environment variables
CloudWatch Logs
Access Lambda logs via CloudWatch:
# View logs using SST
npx sst logs --stage production
# View specific function logs
npx sst logs --stage production --function api
# Tail logs in real-time
npx sst logs --stage production --function api --tail
# Filter logs
npx sst logs --stage production --function api --filter "ERROR"
# Show logs from specific time
npx sst logs --stage production --function api --since 1h
npx sst logs --stage production --function api --since "2024-01-15 10:00"
AWS CLI
Use AWS CLI for advanced log queries:
# List log groups
aws logs describe-log-groups \
--log-group-name-prefix "/aws/lambda/sgcarstrends"
# Get recent log streams
aws logs describe-log-streams \
--log-group-name "/aws/lambda/sgcarstrends-api-production" \
--order-by LastEventTime \
--descending \
--max-items 5
# Tail logs
aws logs tail "/aws/lambda/sgcarstrends-api-production" --follow
# Filter logs
aws logs filter-log-events \
--log-group-name "/aws/lambda/sgcarstrends-api-production" \
--filter-pattern "ERROR" \
--start-time $(date -u -d '1 hour ago' +%s)000
# Get logs for specific request
aws logs filter-log-events \
--log-group-name "/aws/lambda/sgcarstrends-api-production" \
--filter-pattern "request-id-here"
CloudWatch Metrics
Lambda Metrics
# Get Lambda invocations
aws cloudwatch get-metric-statistics \
--namespace AWS/Lambda \
--metric-name Invocations \
--dimensions Name=FunctionName,Value=sgcarstrends-api-production \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 300 \
--statistics Sum
# Get errors
aws cloudwatch get-metric-statistics \
--namespace AWS/Lambda \
--metric-name Errors \
--dimensions Name=FunctionName,Value=sgcarstrends-api-production \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 300 \
--statistics Sum
# Get duration
aws cloudwatch get-metric-statistics \
--namespace AWS/Lambda \
--metric-name Duration \
--dimensions Name=FunctionName,Value=sgcarstrends-api-production \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 300 \
--statistics Average,Maximum
API Gateway Metrics
# Get API requests
aws cloudwatch get-metric-statistics \
--namespace AWS/ApiGateway \
--metric-name Count \
--dimensions Name=ApiName,Value=sgcarstrends-api \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 300 \
--statistics Sum
# Get 4XX errors
aws cloudwatch get-metric-statistics \
--namespace AWS/ApiGateway \
--metric-name 4XXError \
--dimensions Name=ApiName,Value=sgcarstrends-api \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 300 \
--statistics Sum
# Get latency
aws cloudwatch get-metric-statistics \
--namespace AWS/ApiGateway \
--metric-name Latency \
--dimensions Name=ApiName,Value=sgcarstrends-api \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 300 \
--statistics Average,Maximum,p99
CloudWatch Alarms
Creating Alarms
// infra/alarms.ts
import { StackContext, use } from "sst/constructs";
import * as cloudwatch from "aws-cdk-lib/aws-cloudwatch";
import * as sns from "aws-cdk-lib/aws-sns";
import * as subscriptions from "aws-cdk-lib/aws-sns-subscriptions";
import { API } from "./api";
export function Alarms({ stack, app }: StackContext) {
const { api } = use(API);
// Only create alarms for production
if (app.stage !== "production") {
return;
}
// SNS topic for alarms
const alarmTopic = new sns.Topic(stack, "AlarmTopic");
// Add email subscription
alarmTopic.addSubscription(
new subscriptions.EmailSubscription("[email protected]")
);
// High error rate alarm
new cloudwatch.Alarm(stack, "ApiHighErrorRate", {
metric: api.metricErrors(),
threshold: 10,
evaluationPeriods: 2,
datapointsToAlarm: 2,
alarmDescription: "API has high error rate",
treatMissingData: cloudwatch.TreatMissingData.NOT_BREACHING,
}).addAlarmAction(new cloudwatch.SnsAction(alarmTopic));
// High duration alarm
new cloudwatch.Alarm(stack, "ApiHighDuration", {
metric: api.metricDuration(),
threshold: 5000, // 5 seconds
evaluationPeriods: 2,
datapointsToAlarm: 2,
alarmDescription: "API response time is high",
treatMissingData: cloudwatch.TreatMissingData.NOT_BREACHING,
}).addAlarmAction(new cloudwatch.SnsAction(alarmTopic));
// Throttle alarm
new cloudwatch.Alarm(stack, "ApiThrottled", {
metric: api.metricThrottles(),
threshold: 1,
evaluationPeriods: 1,
alarmDescription: "API is being throttled",
treatMissingData: cloudwatch.TreatMissingData.NOT_BREACHING,
}).addAlarmAction(new cloudwatch.SnsAction(alarmTopic));
}
Add to SST config:
// infra/sst.config.ts
import { Alarms } from "./alarms";
export default {
stacks(app) {
app
.stack(DNS)
.stack(API)
.stack(Web)
.stack(Alarms); // Add alarms stack
},
} satisfies SSTConfig;
Managing Alarms via CLI
# List alarms
aws cloudwatch describe-alarms
# Get alarm state
aws cloudwatch describe-alarms \
--alarm-names "sgcarstrends-ApiHighErrorRate"
# Disable alarm
aws cloudwatch disable-alarm-actions \
--alarm-names "sgcarstrends-ApiHighErrorRate"
# Enable alarm
aws cloudwatch enable-alarm-actions \
--alarm-names "sgcarstrends-ApiHighErrorRate"
# Delete alarm
aws cloudwatch delete-alarms \
--alarm-names "sgcarstrends-ApiHighErrorRate"
CloudWatch Insights
Querying Logs
# Start query
aws logs start-query \
--log-group-name "/aws/lambda/sgcarstrends-api-production" \
--start-time $(date -u -d '1 hour ago' +%s) \
--end-time $(date -u +%s) \
--query-string 'fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 20'
# Get query results
aws logs get-query-results --query-id <query-id>
Common Queries
Find errors:
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 20
API performance:
fields @timestamp, @duration
| stats avg(@duration), max(@duration), min(@duration)
Count errors by type:
fields @message
| filter @message like /ERROR/
| parse @message /(?<errorType>\w+Error)/
| stats count() by errorType
Slow requests:
fields @timestamp, @duration, @requestId
| filter @duration > 1000
| sort @duration desc
| limit 20
Request rate:
fields @timestamp
| stats count() by bin(5m)
X-Ray Tracing
Enable X-Ray
// infra/api.ts
import { StackContext, Function } from "sst/constructs";
import * as lambda from "aws-cdk-lib/aws-lambda";
export function API({ stack }: StackContext) {
const api = new Function(stack, "api", {
handler: "apps/api/src/index.handler",
tracing: lambda.Tracing.ACTIVE, // Enable X-Ray
});
return { api };
}
Instrument Code
// apps/api/src/index.ts
import { captureAWSv3Client } from "aws-xray-sdk-core";
import { DynamoDBClient } from "@aws-sdk/client-dynamodb";
// Wrap AWS SDK clients
const client = captureAWSv3Client(new DynamoDBClient({}));
View Traces
# Get service graph
aws xray get-service-graph \
--start-time $(date -u -d '1 hour ago' +%s) \
--end-time $(date -u +%s)
# Get trace summaries
aws xray get-trace-summaries \
--start-time $(date -u -d '1 hour ago' +%s) \
--end-time $(date -u +%s)
# Get trace details
aws xray batch-get-traces --trace-ids <trace-id>
Resource Monitoring
Lambda Functions
# List functions
aws lambda list-functions --query 'Functions[?starts_with(FunctionName, `sgcarstrends`)].FunctionName'
# Get function config
aws lambda get-function-configuration \
--function-name sgcarstrends-api-production
# Get function code location
aws lambda get-function \
--function-name sgcarstrends-api-production
# Invoke function
aws lambda invoke \
--function-name sgcarstrends-api-production \
--payload '{"path": "/health"}' \
response.json
cat response.json
CloudFront Distributions
# List distributions
aws cloudfront list-distributions \
--query 'DistributionList.Items[*].[Id,DomainName,Status]' \
--output table
# Get distribution config
aws cloudfront get-distribution-config --id <distribution-id>
# Create invalidation (cache clear)
aws cloudfront create-invalidation \
--distribution-id <distribution-id> \
--paths "/*"
# List invalidations
aws cloudfront list-invalidations --distribution-id <distribution-id>
S3 Buckets
# List buckets
aws s3 ls
# Get bucket size
aws s3 ls s3://bucket-name --recursive --summarize | grep "Total Size"
# Monitor bucket metrics
aws cloudwatch get-metric-statistics \
--namespace AWS/S3 \
--metric-name BucketSizeBytes \
--dimensions Name=BucketName,Value=bucket-name Name=StorageType,Value=StandardStorage \
--start-time $(date -u -d '1 day ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 86400 \
--statistics Average
Cost Monitoring
Cost Explorer
# Get cost and usage
aws ce get-cost-and-usage \
--time-period Start=$(date -u -d '1 month ago' +%Y-%m-%d),End=$(date -u +%Y-%m-%d) \
--granularity MONTHLY \
--metrics BlendedCost \
--group-by Type=SERVICE
# Get cost by tag
aws ce get-cost-and-usage \
--time-period Start=$(date -u -d '1 month ago' +%Y-%m-%d),End=$(date -u +%Y-%m-%d) \
--granularity MONTHLY \
--metrics BlendedCost \
--group-by Type=TAG,Key=Environment
Budget Alerts
Create budget in AWS Console or via CLI:
# Create budget
aws budgets create-budget \
--account-id $(aws sts get-caller-identity --query Account --output text) \
--budget file://budget.json \
--notifications-with-subscribers file://notifications.json
Debugging Production Issues
1. Check Recent Deployments
# Get stack events
aws cloudformation describe-stack-events \
--stack-name sgcarstrends-api-production \
--max-items 50
# Get deployment status
npx sst stacks info API --stage production
2. Check Logs for Errors
# Get recent errors
npx sst logs --stage production --function api --filter "ERROR" --since 1h
# Or use AWS CLI
aws logs tail "/aws/lambda/sgcarstrends-api-production" \
--follow \
--filter-pattern "ERROR"
3. Check Metrics
# Check invocations and errors
aws cloudwatch get-metric-statistics \
--namespace AWS/Lambda \
--metric-name Invocations \
--dimensions Name=FunctionName,Value=sgcarstrends-api-production \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 300 \
--statistics Sum
4. Test Endpoint
# Test API directly
curl -I https://api.sgcarstrends.com/health
# Test with verbose output
curl -v https://api.sgcarstrends.com/health
5. Check Resource Limits
# Check Lambda quotas
aws service-quotas get-service-quota \
--service-code lambda \
--quota-code L-B99A9384 # Concurrent executions
# Check API Gateway quotas
aws service-quotas list-service-quotas \
--service-code apigateway
Common Issues
High Latency
Investigation:
- Check Lambda duration metrics
- Review CloudWatch Insights for slow queries
- Check database connection pool
- Review API response times
Solutions:
- Increase Lambda memory
- Optimize database queries
- Add caching
- Use connection pooling
High Error Rate
Investigation:
- Check error logs
- Review error types
- Check external service status
- Verify environment variables
Solutions:
- Fix application bugs
- Add error handling
- Retry failed requests
- Check API rate limits
Cold Starts
Investigation:
- Check init duration
- Review package size
- Check provisioned concurrency
Solutions:
- Enable provisioned concurrency
- Reduce bundle size
- Use ARM architecture
- Optimize imports
Monitoring Scripts
Health Check Script
#!/bin/bash
# scripts/health-check.sh
STAGE=${1:-production}
API_URL="https://api${STAGE:+.$STAGE}.sgcarstrends.com"
echo "Checking health of $STAGE environment..."
# Check API
API_STATUS=$(curl -s -o /dev/null -w "%{http_code}" $API_URL/health)
if [ $API_STATUS -eq 200 ]; then
echo "✓ API is healthy"
else
echo "✗ API is down (status: $API_STATUS)"
exit 1
fi
# Check Web
WEB_URL="https://${STAGE:+$STAGE.}sgcarstrends.com"
WEB_STATUS=$(curl -s -o /dev/null -w "%{http_code}" $WEB_URL)
if [ $WEB_STATUS -eq 200 ]; then
echo "✓ Web is healthy"
else
echo "✗ Web is down (status: $WEB_STATUS)"
exit 1
fi
echo "All services are healthy!"
Run:
chmod +x scripts/health-check.sh
./scripts/health-check.sh production
Log Analysis Script
#!/bin/bash
# scripts/analyze-logs.sh
STAGE=${1:-production}
LOG_GROUP="/aws/lambda/sgcarstrends-api-$STAGE"
echo "Analyzing logs for $STAGE..."
# Count errors in last hour
ERROR_COUNT=$(aws logs filter-log-events \
--log-group-name $LOG_GROUP \
--filter-pattern "ERROR" \
--start-time $(date -u -d '1 hour ago' +%s)000 \
--query 'events[*].message' \
--output text | wc -l)
echo "Errors in last hour: $ERROR_COUNT"
# Get top errors
echo -e "\nTop error types:"
aws logs filter-log-events \
--log-group-name $LOG_GROUP \
--filter-pattern "ERROR" \
--start-time $(date -u -d '1 hour ago' +%s)000 \
--query 'events[*].message' \
--output text | \
grep -oE '\w+Error' | \
sort | uniq -c | sort -rn | head -5
References
- CloudWatch Documentation: https://docs.aws.amazon.com/cloudwatch
- Lambda Monitoring: https://docs.aws.amazon.com/lambda/latest/dg/monitoring-functions.html
- X-Ray: https://docs.aws.amazon.com/xray
- Related files:
infra/- Infrastructure with monitoring config- Root CLAUDE.md - Project documentation
Best Practices
- Log Levels: Use appropriate log levels (DEBUG, INFO, WARN, ERROR)
- Structured Logging: Use JSON format for easier parsing
- Correlation IDs: Track requests across services
- Alarms: Set up alarms for critical metrics
- Dashboards: Create CloudWatch dashboards for key metrics
- Cost Monitoring: Track AWS costs regularly
- Regular Reviews: Review logs and metrics weekly
- Retention: Set appropriate log retention (7-30 days)
Quick Install
/plugin add https://github.com/sgcarstrends/sgcarstrends/tree/main/aws-monitoringCopy and paste this command in Claude Code to install this skill
GitHub 仓库
Related Skills
sglang
MetaSGLang is a high-performance LLM serving framework that specializes in fast, structured generation for JSON, regex, and agentic workflows using its RadixAttention prefix caching. It delivers significantly faster inference, especially for tasks with repeated prefixes, making it ideal for complex, structured outputs and multi-turn conversations. Choose SGLang over alternatives like vLLM when you need constrained decoding or are building applications with extensive prefix sharing.
evaluating-llms-harness
TestingThis Claude Skill runs the lm-evaluation-harness to benchmark LLMs across 60+ standardized academic tasks like MMLU and GSM8K. It's designed for developers to compare model quality, track training progress, or report academic results. The tool supports various backends including HuggingFace and vLLM models.
llamaguard
OtherLlamaGuard is Meta's 7-8B parameter model for moderating LLM inputs and outputs across six safety categories like violence and hate speech. It offers 94-95% accuracy and can be deployed using vLLM, Hugging Face, or Amazon SageMaker. Use this skill to easily integrate content filtering and safety guardrails into your AI applications.
langchain
MetaLangChain is a framework for building LLM applications using agents, chains, and RAG pipelines. It supports multiple LLM providers, offers 500+ integrations, and includes features like tool calling and memory management. Use it for rapid prototyping and deploying production systems like chatbots, autonomous agents, and question-answering services.
