aws-monitoring

sgcarstrends

Updated Today

23 views

Testingai

About

This AWS monitoring skill helps developers debug production issues and deployment failures in AWS environments. It enables checking Lambda logs, monitoring CloudWatch metrics, and troubleshooting resource problems using tools like the SST console. Use it when investigating errors, analyzing API performance, or setting up monitoring alarms.

Documentation

AWS Monitoring Skill

This skill helps you monitor and debug AWS resources for the SG Cars Trends platform.

When to Use This Skill

Investigating production errors
Checking Lambda function logs
Monitoring API performance
Debugging deployment failures
Analyzing CloudWatch metrics
Setting up alarms
Troubleshooting resource issues

Monitoring Tools

SST Console

SST provides a built-in console for monitoring:

# Open SST console for specific stage
npx sst console --stage production
npx sst console --stage staging
npx sst console --stage dev

Features:

Real-time Lambda logs
Function invocations
Error tracking
Resource overview
Environment variables

CloudWatch Logs

Access Lambda logs via CloudWatch:

# View logs using SST
npx sst logs --stage production

# View specific function logs
npx sst logs --stage production --function api

# Tail logs in real-time
npx sst logs --stage production --function api --tail

# Filter logs
npx sst logs --stage production --function api --filter "ERROR"

# Show logs from specific time
npx sst logs --stage production --function api --since 1h
npx sst logs --stage production --function api --since "2024-01-15 10:00"

AWS CLI

Use AWS CLI for advanced log queries:

# List log groups
aws logs describe-log-groups \
  --log-group-name-prefix "/aws/lambda/sgcarstrends"

# Get recent log streams
aws logs describe-log-streams \
  --log-group-name "/aws/lambda/sgcarstrends-api-production" \
  --order-by LastEventTime \
  --descending \
  --max-items 5

# Tail logs
aws logs tail "/aws/lambda/sgcarstrends-api-production" --follow

# Filter logs
aws logs filter-log-events \
  --log-group-name "/aws/lambda/sgcarstrends-api-production" \
  --filter-pattern "ERROR" \
  --start-time $(date -u -d '1 hour ago' +%s)000

# Get logs for specific request
aws logs filter-log-events \
  --log-group-name "/aws/lambda/sgcarstrends-api-production" \
  --filter-pattern "request-id-here"

CloudWatch Metrics

Lambda Metrics

# Get Lambda invocations
aws cloudwatch get-metric-statistics \
  --namespace AWS/Lambda \
  --metric-name Invocations \
  --dimensions Name=FunctionName,Value=sgcarstrends-api-production \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 \
  --statistics Sum

# Get errors
aws cloudwatch get-metric-statistics \
  --namespace AWS/Lambda \
  --metric-name Errors \
  --dimensions Name=FunctionName,Value=sgcarstrends-api-production \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 \
  --statistics Sum

# Get duration
aws cloudwatch get-metric-statistics \
  --namespace AWS/Lambda \
  --metric-name Duration \
  --dimensions Name=FunctionName,Value=sgcarstrends-api-production \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 \
  --statistics Average,Maximum

API Gateway Metrics

# Get API requests
aws cloudwatch get-metric-statistics \
  --namespace AWS/ApiGateway \
  --metric-name Count \
  --dimensions Name=ApiName,Value=sgcarstrends-api \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 \
  --statistics Sum

# Get 4XX errors
aws cloudwatch get-metric-statistics \
  --namespace AWS/ApiGateway \
  --metric-name 4XXError \
  --dimensions Name=ApiName,Value=sgcarstrends-api \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 \
  --statistics Sum

# Get latency
aws cloudwatch get-metric-statistics \
  --namespace AWS/ApiGateway \
  --metric-name Latency \
  --dimensions Name=ApiName,Value=sgcarstrends-api \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 \
  --statistics Average,Maximum,p99

CloudWatch Alarms

Creating Alarms

// infra/alarms.ts
import { StackContext, use } from "sst/constructs";
import * as cloudwatch from "aws-cdk-lib/aws-cloudwatch";
import * as sns from "aws-cdk-lib/aws-sns";
import * as subscriptions from "aws-cdk-lib/aws-sns-subscriptions";
import { API } from "./api";

export function Alarms({ stack, app }: StackContext) {
  const { api } = use(API);

  // Only create alarms for production
  if (app.stage !== "production") {
    return;
  }

  // SNS topic for alarms
  const alarmTopic = new sns.Topic(stack, "AlarmTopic");

  // Add email subscription
  alarmTopic.addSubscription(
    new subscriptions.EmailSubscription("[email protected]")
  );

  // High error rate alarm
  new cloudwatch.Alarm(stack, "ApiHighErrorRate", {
    metric: api.metricErrors(),
    threshold: 10,
    evaluationPeriods: 2,
    datapointsToAlarm: 2,
    alarmDescription: "API has high error rate",
    treatMissingData: cloudwatch.TreatMissingData.NOT_BREACHING,
  }).addAlarmAction(new cloudwatch.SnsAction(alarmTopic));

  // High duration alarm
  new cloudwatch.Alarm(stack, "ApiHighDuration", {
    metric: api.metricDuration(),
    threshold: 5000, // 5 seconds
    evaluationPeriods: 2,
    datapointsToAlarm: 2,
    alarmDescription: "API response time is high",
    treatMissingData: cloudwatch.TreatMissingData.NOT_BREACHING,
  }).addAlarmAction(new cloudwatch.SnsAction(alarmTopic));

  // Throttle alarm
  new cloudwatch.Alarm(stack, "ApiThrottled", {
    metric: api.metricThrottles(),
    threshold: 1,
    evaluationPeriods: 1,
    alarmDescription: "API is being throttled",
    treatMissingData: cloudwatch.TreatMissingData.NOT_BREACHING,
  }).addAlarmAction(new cloudwatch.SnsAction(alarmTopic));
}

Add to SST config:

// infra/sst.config.ts
import { Alarms } from "./alarms";

export default {
  stacks(app) {
    app
      .stack(DNS)
      .stack(API)
      .stack(Web)
      .stack(Alarms); // Add alarms stack
  },
} satisfies SSTConfig;

Managing Alarms via CLI

# List alarms
aws cloudwatch describe-alarms

# Get alarm state
aws cloudwatch describe-alarms \
  --alarm-names "sgcarstrends-ApiHighErrorRate"

# Disable alarm
aws cloudwatch disable-alarm-actions \
  --alarm-names "sgcarstrends-ApiHighErrorRate"

# Enable alarm
aws cloudwatch enable-alarm-actions \
  --alarm-names "sgcarstrends-ApiHighErrorRate"

# Delete alarm
aws cloudwatch delete-alarms \
  --alarm-names "sgcarstrends-ApiHighErrorRate"

CloudWatch Insights

Querying Logs

# Start query
aws logs start-query \
  --log-group-name "/aws/lambda/sgcarstrends-api-production" \
  --start-time $(date -u -d '1 hour ago' +%s) \
  --end-time $(date -u +%s) \
  --query-string 'fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 20'

# Get query results
aws logs get-query-results --query-id <query-id>

Common Queries

Find errors:

fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 20

API performance:

fields @timestamp, @duration
| stats avg(@duration), max(@duration), min(@duration)

Count errors by type:

fields @message
| filter @message like /ERROR/
| parse @message /(?<errorType>\w+Error)/
| stats count() by errorType

Slow requests:

fields @timestamp, @duration, @requestId
| filter @duration > 1000
| sort @duration desc
| limit 20

Request rate:

fields @timestamp
| stats count() by bin(5m)

X-Ray Tracing

Enable X-Ray

// infra/api.ts
import { StackContext, Function } from "sst/constructs";
import * as lambda from "aws-cdk-lib/aws-lambda";

export function API({ stack }: StackContext) {
  const api = new Function(stack, "api", {
    handler: "apps/api/src/index.handler",
    tracing: lambda.Tracing.ACTIVE, // Enable X-Ray
  });

  return { api };
}

Instrument Code

// apps/api/src/index.ts
import { captureAWSv3Client } from "aws-xray-sdk-core";
import { DynamoDBClient } from "@aws-sdk/client-dynamodb";

// Wrap AWS SDK clients
const client = captureAWSv3Client(new DynamoDBClient({}));

View Traces

# Get service graph
aws xray get-service-graph \
  --start-time $(date -u -d '1 hour ago' +%s) \
  --end-time $(date -u +%s)

# Get trace summaries
aws xray get-trace-summaries \
  --start-time $(date -u -d '1 hour ago' +%s) \
  --end-time $(date -u +%s)

# Get trace details
aws xray batch-get-traces --trace-ids <trace-id>

Resource Monitoring

Lambda Functions

# List functions
aws lambda list-functions --query 'Functions[?starts_with(FunctionName, `sgcarstrends`)].FunctionName'

# Get function config
aws lambda get-function-configuration \
  --function-name sgcarstrends-api-production

# Get function code location
aws lambda get-function \
  --function-name sgcarstrends-api-production

# Invoke function
aws lambda invoke \
  --function-name sgcarstrends-api-production \
  --payload '{"path": "/health"}' \
  response.json

cat response.json

CloudFront Distributions

# List distributions
aws cloudfront list-distributions \
  --query 'DistributionList.Items[*].[Id,DomainName,Status]' \
  --output table

# Get distribution config
aws cloudfront get-distribution-config --id <distribution-id>

# Create invalidation (cache clear)
aws cloudfront create-invalidation \
  --distribution-id <distribution-id> \
  --paths "/*"

# List invalidations
aws cloudfront list-invalidations --distribution-id <distribution-id>

S3 Buckets

# List buckets
aws s3 ls

# Get bucket size
aws s3 ls s3://bucket-name --recursive --summarize | grep "Total Size"

# Monitor bucket metrics
aws cloudwatch get-metric-statistics \
  --namespace AWS/S3 \
  --metric-name BucketSizeBytes \
  --dimensions Name=BucketName,Value=bucket-name Name=StorageType,Value=StandardStorage \
  --start-time $(date -u -d '1 day ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 86400 \
  --statistics Average

Cost Monitoring

Cost Explorer

# Get cost and usage
aws ce get-cost-and-usage \
  --time-period Start=$(date -u -d '1 month ago' +%Y-%m-%d),End=$(date -u +%Y-%m-%d) \
  --granularity MONTHLY \
  --metrics BlendedCost \
  --group-by Type=SERVICE

# Get cost by tag
aws ce get-cost-and-usage \
  --time-period Start=$(date -u -d '1 month ago' +%Y-%m-%d),End=$(date -u +%Y-%m-%d) \
  --granularity MONTHLY \
  --metrics BlendedCost \
  --group-by Type=TAG,Key=Environment

Budget Alerts

Create budget in AWS Console or via CLI:

# Create budget
aws budgets create-budget \
  --account-id $(aws sts get-caller-identity --query Account --output text) \
  --budget file://budget.json \
  --notifications-with-subscribers file://notifications.json

Debugging Production Issues

1. Check Recent Deployments

# Get stack events
aws cloudformation describe-stack-events \
  --stack-name sgcarstrends-api-production \
  --max-items 50

# Get deployment status
npx sst stacks info API --stage production

2. Check Logs for Errors

# Get recent errors
npx sst logs --stage production --function api --filter "ERROR" --since 1h

# Or use AWS CLI
aws logs tail "/aws/lambda/sgcarstrends-api-production" \
  --follow \
  --filter-pattern "ERROR"

3. Check Metrics

# Check invocations and errors
aws cloudwatch get-metric-statistics \
  --namespace AWS/Lambda \
  --metric-name Invocations \
  --dimensions Name=FunctionName,Value=sgcarstrends-api-production \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 \
  --statistics Sum

4. Test Endpoint

# Test API directly
curl -I https://api.sgcarstrends.com/health

# Test with verbose output
curl -v https://api.sgcarstrends.com/health

5. Check Resource Limits

# Check Lambda quotas
aws service-quotas get-service-quota \
  --service-code lambda \
  --quota-code L-B99A9384  # Concurrent executions

# Check API Gateway quotas
aws service-quotas list-service-quotas \
  --service-code apigateway

Common Issues

High Latency

Investigation:

Check Lambda duration metrics
Review CloudWatch Insights for slow queries
Check database connection pool
Review API response times

Solutions:

Increase Lambda memory
Optimize database queries
Add caching
Use connection pooling

High Error Rate

Investigation:

Check error logs
Review error types
Check external service status
Verify environment variables

Solutions:

Fix application bugs
Add error handling
Retry failed requests
Check API rate limits

Cold Starts

Investigation:

Check init duration
Review package size
Check provisioned concurrency

Solutions:

Enable provisioned concurrency
Reduce bundle size
Use ARM architecture
Optimize imports

Monitoring Scripts

Health Check Script

#!/bin/bash
# scripts/health-check.sh

STAGE=${1:-production}
API_URL="https://api${STAGE:+.$STAGE}.sgcarstrends.com"

echo "Checking health of $STAGE environment..."

# Check API
API_STATUS=$(curl -s -o /dev/null -w "%{http_code}" $API_URL/health)

if [ $API_STATUS -eq 200 ]; then
  echo "✓ API is healthy"
else
  echo "✗ API is down (status: $API_STATUS)"
  exit 1
fi

# Check Web
WEB_URL="https://${STAGE:+$STAGE.}sgcarstrends.com"
WEB_STATUS=$(curl -s -o /dev/null -w "%{http_code}" $WEB_URL)

if [ $WEB_STATUS -eq 200 ]; then
  echo "✓ Web is healthy"
else
  echo "✗ Web is down (status: $WEB_STATUS)"
  exit 1
fi

echo "All services are healthy!"

Run:

chmod +x scripts/health-check.sh
./scripts/health-check.sh production

Log Analysis Script

#!/bin/bash
# scripts/analyze-logs.sh

STAGE=${1:-production}
LOG_GROUP="/aws/lambda/sgcarstrends-api-$STAGE"

echo "Analyzing logs for $STAGE..."

# Count errors in last hour
ERROR_COUNT=$(aws logs filter-log-events \
  --log-group-name $LOG_GROUP \
  --filter-pattern "ERROR" \
  --start-time $(date -u -d '1 hour ago' +%s)000 \
  --query 'events[*].message' \
  --output text | wc -l)

echo "Errors in last hour: $ERROR_COUNT"

# Get top errors
echo -e "\nTop error types:"
aws logs filter-log-events \
  --log-group-name $LOG_GROUP \
  --filter-pattern "ERROR" \
  --start-time $(date -u -d '1 hour ago' +%s)000 \
  --query 'events[*].message' \
  --output text | \
  grep -oE '\w+Error' | \
  sort | uniq -c | sort -rn | head -5

References

CloudWatch Documentation: https://docs.aws.amazon.com/cloudwatch
Lambda Monitoring: https://docs.aws.amazon.com/lambda/latest/dg/monitoring-functions.html
X-Ray: https://docs.aws.amazon.com/xray
Related files:
- infra/ - Infrastructure with monitoring config
- Root CLAUDE.md - Project documentation

Best Practices

Log Levels: Use appropriate log levels (DEBUG, INFO, WARN, ERROR)
Structured Logging: Use JSON format for easier parsing
Correlation IDs: Track requests across services
Alarms: Set up alarms for critical metrics
Dashboards: Create CloudWatch dashboards for key metrics
Cost Monitoring: Track AWS costs regularly
Regular Reviews: Review logs and metrics weekly
Retention: Set appropriate log retention (7-30 days)

Quick Install

/plugin add https://github.com/sgcarstrends/sgcarstrends/tree/main/aws-monitoring

Copy and paste this command in Claude Code to install this skill

GitHub 仓库

sgcarstrends/sgcarstrends

Path: .claude/skills/aws-monitoring

apiaws-lambdabackendhonojob-schedulerneon-postgres

Related Skills

sglang

evaluating-llms-harness

Testing

This Claude Skill runs the lm-evaluation-harness to benchmark LLMs across 60+ standardized academic tasks like MMLU and GSM8K. It's designed for developers to compare model quality, track training progress, or report academic results. The tool supports various backends including HuggingFace and vLLM models.

View skill

llamaguard

Other

LlamaGuard is Meta's 7-8B parameter model for moderating LLM inputs and outputs across six safety categories like violence and hate speech. It offers 94-95% accuracy and can be deployed using vLLM, Hugging Face, or Amazon SageMaker. Use this skill to easily integrate content filtering and safety guardrails into your AI applications.

View skill

aws-monitoring

About

Documentation

AWS Monitoring Skill

When to Use This Skill

Monitoring Tools

SST Console

CloudWatch Logs

AWS CLI

CloudWatch Metrics

Lambda Metrics

API Gateway Metrics

CloudWatch Alarms

Creating Alarms

Managing Alarms via CLI

CloudWatch Insights

Querying Logs

Common Queries

X-Ray Tracing

Enable X-Ray

Instrument Code

View Traces

Resource Monitoring

Lambda Functions

CloudFront Distributions

S3 Buckets

Cost Monitoring

Cost Explorer

Budget Alerts

Debugging Production Issues

1. Check Recent Deployments

2. Check Logs for Errors

3. Check Metrics

4. Test Endpoint

5. Check Resource Limits

Common Issues

High Latency

High Error Rate

Cold Starts

Monitoring Scripts

Health Check Script

Log Analysis Script

References

Best Practices

Quick Install

GitHub 仓库

Related Skills

sglang

evaluating-llms-harness

llamaguard

langchain