architecture-documenter
About
The architecture-documenter skill analyzes software systems to create comprehensive architecture documentation and technical decision records. It generates system overviews, diagrams, and ADRs to explain components, data flows, and design trade-offs. Use this skill to document architecture for effective team communication, knowledge sharing, and decision tracking.
Documentation
Architecture Documenter Skill
Document system architecture and technical design decisions for effective team communication and knowledge sharing.
Instructions
You are a software architecture documentation expert. When invoked:
-
Analyze System Architecture:
- Identify key components and services
- Understand data flows and interactions
- Map dependencies and integrations
- Recognize architectural patterns
- Assess scalability and reliability
-
Create Architecture Documentation:
- System overview and context
- Component diagrams and relationships
- Data flow diagrams
- Deployment architecture
- Security architecture
- Decision records (ADRs)
-
Document Technical Decisions:
- What was decided
- Why it was decided
- Alternatives considered
- Trade-offs made
- Implementation details
- Future considerations
-
Use Visual Diagrams:
- System architecture diagrams
- Sequence diagrams
- Entity-relationship diagrams
- Infrastructure diagrams
- Network topology
- State machines
-
Maintain Living Documentation:
- Keep docs synchronized with code
- Version architecture docs
- Track evolution over time
- Mark deprecated components
- Update with lessons learned
Architecture Documentation Templates
System Architecture Document Template
# E-Commerce Platform - System Architecture
**Version**: 2.3
**Last Updated**: January 15, 2024
**Status**: Current
**Authors**: Engineering Team
**Reviewers**: Alice (EM), Bob (Tech Lead)
---
## Table of Contents
1. [Executive Summary](#executive-summary)
2. [System Context](#system-context)
3. [Architecture Overview](#architecture-overview)
4. [Core Components](#core-components)
5. [Data Architecture](#data-architecture)
6. [Infrastructure](#infrastructure)
7. [Security Architecture](#security-architecture)
8. [Scalability & Performance](#scalability--performance)
9. [Deployment](#deployment)
10. [Monitoring & Observability](#monitoring--observability)
11. [Future Considerations](#future-considerations)
---
## Executive Summary
### What This System Does
The E-Commerce Platform is a modern, cloud-native application that enables small to medium businesses to sell products online. It handles the complete e-commerce lifecycle from product catalog management to order fulfillment.
### Key Capabilities
- **Product Management**: Create, update, and manage product catalogs
- **Shopping Experience**: Browse products, search, filter, and compare
- **Checkout & Payments**: Secure checkout with multiple payment options
- **Order Management**: Track orders from placement to delivery
- **User Accounts**: Customer profiles, order history, preferences
- **Admin Dashboard**: Business analytics, inventory management
### System Scale
| Metric | Current | Target (6 months) |
|--------|---------|-------------------|
| Active Users | 5,000 businesses | 15,000 businesses |
| Products | 500,000 | 2,000,000 |
| Daily Orders | 10,000 | 50,000 |
| Monthly GMV | $2M | $10M |
| Peak RPS | 500 | 2,000 |
| Data Storage | 2 TB | 10 TB |
### Technology Stack Summary
- **Frontend**: React, TypeScript, Redux, Material-UI
- **Backend**: Node.js, Express, TypeScript
- **Database**: PostgreSQL (primary), Redis (cache)
- **Storage**: AWS S3
- **Hosting**: AWS (ECS, RDS, ElastiCache, CloudFront)
- **CI/CD**: GitHub Actions
- **Monitoring**: DataDog, Sentry
---
## System Context
### Business Context
**Problem We Solve**: Small businesses struggle with expensive, complex e-commerce solutions. Our platform provides an affordable, easy-to-use alternative.
**Target Users**:
- Small business owners (10-1000 products)
- Digital creators selling physical products
- Retail stores expanding online
**Business Model**: SaaS subscription ($29-$299/month) + transaction fees (2.9% + $0.30)
### System Boundary
┌─────────────────────────────────────────────────────┐ │ E-Commerce Platform │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │ │ │ Customer │ │ Merchant │ │ Admin │ │ │ │ Web │ │Dashboard │ │ Portal │ │ │ └──────────┘ └──────────┘ └──────────────┘ │ │ │ │ ┌──────────────────────────────────────────────┐ │ │ │ Backend Services │ │ │ │ (Auth, Product, Order, Payment, etc.) │ │ │ └──────────────────────────────────────────────┘ │ │ │ │ ┌──────────────────────────────────────────────┐ │ │ │ Data & Storage Layer │ │ │ │ (PostgreSQL, Redis, S3) │ │ │ └──────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────┘ │ ┌───────────────┼───────────────┐ │ │ │ ┌─────────┐ ┌──────────┐ ┌──────────┐ │ Stripe │ │ SendGrid│ │ Shippo │ │ Payment │ │ Email │ │ Shipping │ └─────────┘ └──────────┘ └──────────┘
### External Dependencies
| Service | Purpose | SLA | Fallback Strategy |
|---------|---------|-----|-------------------|
| Stripe | Payment processing | 99.99% | Queue retries, manual processing |
| SendGrid | Email delivery | 99.95% | Alternative provider (AWS SES) |
| Shippo | Shipping labels | 99.9% | Manual label generation |
| AWS | Infrastructure | 99.99% | Multi-AZ deployment |
| Cloudflare | CDN/DNS | 99.99% | Direct origin access |
---
## Architecture Overview
### High-Level Architecture
Internet
│
▼
┌──────────────┐
│ Cloudflare │ (CDN, DDoS protection)
└──────┬───────┘
│
▼
┌──────────────────────┐
│ AWS CloudFront │ (Static assets)
└──────────────────────┘
│
┌──────────────────┼──────────────────┐
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌──────────────┐ │ React │ │ API Gateway │ │ Admin │ │ Frontend │ │ (Express) │ │ Portal │ │ (CloudFront) │ │ (ALB+ECS) │ │ │ └───────────────┘ └───────┬───────┘ └──────────────┘ │ ┌──────────────────┼──────────────────┐ │ │ │ ▼ ▼ ▼ ┌─────────┐ ┌──────────┐ ┌──────────┐ │ Auth │ │ Product │ │ Order │ │ Service │ │ Service │ │ Service │ └────┬────┘ └────┬─────┘ └────┬─────┘ │ │ │ └────────────────┼──────────────────┘ │ ┌────────────────┼────────────────┐ │ │ │ ▼ ▼ ▼ ┌─────────┐ ┌─────────┐ ┌─────────┐ │PostgreSQL│ │ Redis │ │ S3 │ │ (RDS) │ │(ElastiCache) │(Images) │ └──────────┘ └─────────┘ └─────────┘
### Architecture Style
**Primary Pattern**: Modular Monolith (transitioning to Microservices)
**Rationale**:
- **Current**: Modular monolith provides simplicity while maintaining clear boundaries
- **Future**: Easy migration path to microservices as scale increases
- **Trade-off**: Accepts coupling cost for development velocity at current scale
### Key Architectural Principles
1. **Separation of Concerns**: Clear boundaries between modules
2. **API-First**: All features exposed via REST APIs
3. **Stateless Services**: No server-side session state (JWT-based auth)
4. **Caching Strategy**: Cache aggressively, invalidate carefully
5. **Eventual Consistency**: Accept eventual consistency for non-critical data
6. **Fail Fast**: Return errors quickly rather than retry indefinitely
7. **Observability**: Comprehensive logging, metrics, and tracing
---
## Core Components
### Frontend Application
**Technology**: React 18 + TypeScript + Redux Toolkit
**Structure**:
client/ ├── components/ # Reusable UI components ├── pages/ # Route-level pages ├── store/ # Redux state management ├── api/ # API client ├── hooks/ # Custom React hooks └── utils/ # Utility functions
**Key Features**:
- Server-side rendering (SSR) for SEO
- Code splitting by route
- Progressive Web App (PWA) capabilities
- Optimistic UI updates
- Offline support (service workers)
**State Management**:
- **Redux**: Global application state
- **React Query**: Server state caching
- **Local Storage**: User preferences, cart (guest users)
**Performance Targets**:
- First Contentful Paint: <1.5s
- Time to Interactive: <3s
- Lighthouse Score: >90
---
### API Gateway
**Technology**: Express.js + TypeScript
**Responsibilities**:
- Request routing
- Authentication/authorization
- Rate limiting
- Request/response transformation
- API versioning
- CORS handling
**Middleware Pipeline**:
```javascript
Request
↓
Logging (Morgan)
↓
Rate Limiting (express-rate-limit)
↓
CORS (cors)
↓
Authentication (JWT verification)
↓
Authorization (permission check)
↓
Request Validation (Joi)
↓
Route Handler
↓
Response Formatting
↓
Error Handling
↓
Response
API Versioning Strategy:
- URL versioning:
/api/v1/products,/api/v2/products - Maintain 2 versions simultaneously
- Deprecation warnings in headers
- 6-month sunset period for old versions
Service Modules
Authentication Service
Responsibilities:
- User registration and login
- JWT token generation and validation
- Password reset flow
- OAuth integration (Google, Facebook)
- Multi-factor authentication (MFA)
Database Schema:
users (
id UUID PRIMARY KEY,
email VARCHAR UNIQUE NOT NULL,
password_hash VARCHAR NOT NULL,
email_verified BOOLEAN DEFAULT FALSE,
created_at TIMESTAMP,
updated_at TIMESTAMP
)
sessions (
id UUID PRIMARY KEY,
user_id UUID REFERENCES users(id),
token_hash VARCHAR NOT NULL,
expires_at TIMESTAMP,
created_at TIMESTAMP
)
oauth_accounts (
id UUID PRIMARY KEY,
user_id UUID REFERENCES users(id),
provider VARCHAR NOT NULL, -- 'google', 'facebook'
provider_user_id VARCHAR NOT NULL,
access_token VARCHAR,
refresh_token VARCHAR,
UNIQUE(provider, provider_user_id)
)
Security Measures:
- Passwords hashed with Argon2id
- JWT tokens with 15-minute expiration
- Refresh tokens with 7-day expiration
- Rate limiting: 5 login attempts per 15 minutes
- Account lockout after 10 failed attempts
- MFA via TOTP (Google Authenticator)
Product Service
Responsibilities:
- Product CRUD operations
- Inventory management
- Search and filtering
- Product recommendations
- Category management
Database Schema:
products (
id UUID PRIMARY KEY,
merchant_id UUID REFERENCES users(id),
name VARCHAR NOT NULL,
description TEXT,
price DECIMAL(10,2) NOT NULL,
inventory_count INTEGER NOT NULL DEFAULT 0,
category_id UUID REFERENCES categories(id),
status VARCHAR DEFAULT 'draft', -- draft, active, archived
created_at TIMESTAMP,
updated_at TIMESTAMP
)
product_images (
id UUID PRIMARY KEY,
product_id UUID REFERENCES products(id) ON DELETE CASCADE,
url VARCHAR NOT NULL,
position INTEGER,
created_at TIMESTAMP
)
categories (
id UUID PRIMARY KEY,
name VARCHAR NOT NULL,
parent_id UUID REFERENCES categories(id),
slug VARCHAR UNIQUE NOT NULL
)
Search Implementation:
- PostgreSQL full-text search with trigram indexes
- Elasticsearch for advanced features (planned)
- Caching: 5-minute TTL for product lists, 1-hour for individual products
Performance Optimizations:
- Database indexes on common query fields
- N+1 query prevention with eager loading
- Image CDN with automatic resizing
- Aggressive caching with Redis
Order Service
Responsibilities:
- Shopping cart management
- Order creation and processing
- Order status tracking
- Order history
- Invoice generation
Database Schema:
orders (
id UUID PRIMARY KEY,
customer_id UUID REFERENCES users(id),
status VARCHAR NOT NULL, -- pending, paid, shipped, delivered, cancelled
subtotal DECIMAL(10,2) NOT NULL,
tax DECIMAL(10,2) NOT NULL,
shipping DECIMAL(10,2) NOT NULL,
total DECIMAL(10,2) NOT NULL,
payment_id VARCHAR,
shipping_address_id UUID REFERENCES addresses(id),
created_at TIMESTAMP,
updated_at TIMESTAMP
)
order_items (
id UUID PRIMARY KEY,
order_id UUID REFERENCES orders(id) ON DELETE CASCADE,
product_id UUID REFERENCES products(id),
quantity INTEGER NOT NULL,
price DECIMAL(10,2) NOT NULL,
product_snapshot JSONB -- Product details at time of purchase
)
order_events (
id UUID PRIMARY KEY,
order_id UUID REFERENCES orders(id),
event_type VARCHAR NOT NULL, -- created, paid, shipped, etc.
metadata JSONB,
created_at TIMESTAMP
)
Order State Machine:
pending → paid → processing → shipped → delivered
↓ ↓ ↓ ↓
└───────┴─────────┴───────────┴──→ cancelled
Transaction Handling:
- Database transactions for order creation
- Idempotency keys for payment processing
- Inventory reservation system
- Automatic rollback on payment failure
Payment Service
Responsibilities:
- Payment intent creation
- Payment processing (via Stripe)
- Refund handling
- Payment method management
- Transaction history
Integration with Stripe:
// Payment Intent Flow
1. Client requests payment intent
↓
2. Server creates Stripe PaymentIntent
↓
3. Client collects payment details
↓
4. Client confirms payment with Stripe
↓
5. Stripe webhook notifies server
↓
6. Server updates order status
Webhook Security:
- Stripe signature verification
- Idempotent webhook processing
- Async processing with job queue
- Retry logic for failed webhooks
Database Schema:
payments (
id UUID PRIMARY KEY,
order_id UUID REFERENCES orders(id),
stripe_payment_intent_id VARCHAR UNIQUE,
amount DECIMAL(10,2) NOT NULL,
status VARCHAR NOT NULL, -- pending, succeeded, failed
payment_method VARCHAR, -- card, bank_transfer
metadata JSONB,
created_at TIMESTAMP,
updated_at TIMESTAMP
)
refunds (
id UUID PRIMARY KEY,
payment_id UUID REFERENCES payments(id),
stripe_refund_id VARCHAR UNIQUE,
amount DECIMAL(10,2) NOT NULL,
reason VARCHAR,
status VARCHAR NOT NULL,
created_at TIMESTAMP
)
Data Architecture
Database Design
Primary Database: PostgreSQL 14
Schema Organization:
- public schema: Core application tables
- audit schema: Audit logs and event sourcing
- analytics schema: Denormalized data for reporting
Connection Pooling:
{
max: 20, // Max connections
min: 5, // Min connections
idle: 10000, // Close idle connections after 10s
acquire: 30000, // Max time to acquire connection
evict: 1000 // Check for idle connections every 1s
}
Backup Strategy:
- Automated daily backups (RDS snapshots)
- Point-in-time recovery enabled (7-day window)
- Monthly backups retained for 1 year
- Backup tested quarterly
Caching Strategy
Redis Configuration:
- Deployment: AWS ElastiCache (Redis 7.0)
- Mode: Cluster mode enabled
- Nodes: 3 (primary + 2 replicas)
- Eviction policy: LRU (Least Recently Used)
Cache Patterns:
- Cache-Aside (Read-heavy data):
async function getProduct(id) {
// Try cache first
let product = await cache.get(`product:${id}`);
if (!product) {
// Cache miss - fetch from database
product = await db.products.findById(id);
// Store in cache (1 hour TTL)
await cache.set(`product:${id}`, product, 3600);
}
return product;
}
- Write-Through (Critical data):
async function updateProduct(id, data) {
// Update database
const product = await db.products.update(id, data);
// Update cache
await cache.set(`product:${id}`, product, 3600);
return product;
}
Cache Invalidation:
// Product updated
await cache.del(`product:${productId}`);
await cache.del(`products:merchant:${merchantId}`);
await cache.del(`products:category:${categoryId}`);
// Pattern-based invalidation
await cache.delPattern(`products:*`);
What We Cache:
| Data Type | TTL | Rationale |
|---|---|---|
| Product details | 1 hour | Infrequently updated |
| Product lists | 5 minutes | Frequently updated |
| User sessions | 15 minutes | Security requirement |
| Search results | 10 minutes | Expensive queries |
| API responses | 1 minute | Rate limit protection |
Data Migration Strategy
Tools: Prisma Migrate (development), custom scripts (production)
Migration Process:
- Create migration in development
- Review SQL in PR
- Test on staging (copy of production data)
- Run on production during low-traffic window
- Rollback plan documented
Zero-Downtime Migrations:
-- Example: Adding non-null column
-- Step 1: Add column as nullable
ALTER TABLE products ADD COLUMN new_field VARCHAR;
-- Step 2: Backfill data
UPDATE products SET new_field = 'default_value' WHERE new_field IS NULL;
-- Step 3: Add NOT NULL constraint
ALTER TABLE products ALTER COLUMN new_field SET NOT NULL;
Infrastructure
AWS Architecture
Regions: Primary: us-east-1, Disaster Recovery: us-west-2
VPC Design:
VPC (10.0.0.0/16)
├── Public Subnets (10.0.1.0/24, 10.0.2.0/24)
│ ├── NAT Gateways
│ └── Application Load Balancer
└── Private Subnets (10.0.10.0/24, 10.0.11.0/24)
├── ECS Tasks (Application)
├── RDS (Database)
└── ElastiCache (Redis)
Compute:
- ECS Fargate: Serverless containers for application
- Auto-scaling: Target CPU 70%, min 2 tasks, max 10 tasks
- Task Definition:
CPU: 1024 (1 vCPU) Memory: 2048 MB Container Port: 3000 Environment: Production
Database:
- RDS PostgreSQL: db.r5.large (2 vCPU, 16 GB RAM)
- Multi-AZ: Yes (automatic failover)
- Read Replicas: 1 (for analytics queries)
- Storage: 500 GB GP3 (auto-scaling enabled)
Storage:
- S3 Bucket: product-images-prod
- Lifecycle Policy: Move to Glacier after 90 days
- CDN: CloudFront distribution for images
- Backup: Cross-region replication enabled
Networking:
- Load Balancer: Application Load Balancer (ALB)
- SSL/TLS: ACM certificates (auto-renewal)
- WAF: AWS WAF with OWASP rules
- DDoS Protection: AWS Shield Standard
Deployment Architecture
CI/CD Pipeline (GitHub Actions):
Code Push
↓
Automated Tests (Unit + Integration)
↓
Linting & Type Checking
↓
Build Docker Image
↓
Push to ECR (Elastic Container Registry)
↓
Deploy to Staging (Auto)
↓
Integration Tests (Staging)
↓
Manual Approval
↓
Deploy to Production (Canary)
↓
Monitor Metrics (15 minutes)
↓
Full Rollout or Rollback
Deployment Strategy: Blue-Green with Canary
Production (Blue) Canary (Green)
100% traffic → 95% / 5% split → 0% / 100%
↓
Monitor for 15 min
↓
Success? Full rollout : Rollback
Rollback Procedure:
- Detect issue (automated alerts or manual)
- Trigger rollback command
- Route traffic back to previous version
- Investigate root cause
- Fix and redeploy
Deployment Windows:
- Staging: Anytime
- Production: Tuesday-Thursday, 10 AM - 2 PM EST
- Emergency: 24/7 with on-call approval
Security Architecture
Defense in Depth
Layer 1: Network Security
- VPC isolation
- Security groups (allow-list only)
- Network ACLs
- Private subnets for data layer
- NAT Gateway for outbound traffic
Layer 2: Application Security
- Input validation (all user inputs)
- SQL injection prevention (parameterized queries)
- XSS prevention (sanitization + CSP headers)
- CSRF protection (tokens)
- Rate limiting (DDoS mitigation)
Layer 3: Authentication & Authorization
- JWT with short expiration
- Refresh token rotation
- MFA for admin accounts
- Role-based access control (RBAC)
- Principle of least privilege
Layer 4: Data Security
- Encryption at rest (RDS, S3)
- Encryption in transit (TLS 1.3)
- Secrets in AWS Secrets Manager
- PII data encrypted at field level
- Regular security audits
Security Headers
{
'Strict-Transport-Security': 'max-age=31536000; includeSubDomains',
'Content-Security-Policy': "default-src 'self'; script-src 'self' 'unsafe-inline'",
'X-Frame-Options': 'DENY',
'X-Content-Type-Options': 'nosniff',
'Referrer-Policy': 'strict-origin-when-cross-origin',
'Permissions-Policy': 'geolocation=(), microphone=(), camera=()'
}
Compliance
Standards:
- PCI DSS: Level 2 (Stripe handles Level 1)
- GDPR: User data rights, deletion, export
- SOC 2 Type II: In progress (Q2 2024)
Data Retention:
- User data: Retained until account deletion
- Order data: 7 years (regulatory requirement)
- Logs: 90 days
- Backups: 1 year
Scalability & Performance
Current Capacity
| Metric | Current | Limit | Headroom |
|---|---|---|---|
| Concurrent Users | 500 | 2,000 | 4x |
| Requests/Second | 200 | 1,000 | 5x |
| Database Connections | 50 | 200 | 4x |
| Storage | 500 GB | 2 TB | 4x |
Scaling Strategy
Horizontal Scaling:
- Stateless services (easy to replicate)
- Auto-scaling based on CPU/memory
- Database read replicas for read-heavy workloads
Vertical Scaling:
- Database instance size (scheduled uptime)
- Cache cluster size
Caching:
- Application-level caching (Redis)
- CDN for static assets
- Database query result caching
Database Optimization:
- Indexes on frequently queried fields
- Materialized views for complex queries
- Connection pooling
- Query optimization (EXPLAIN ANALYZE)
Performance Budgets
API Response Times (p95):
- GET requests: <200ms
- POST requests: <500ms
- Complex queries: <1s
Frontend Performance (Lighthouse):
- Performance: >90
- Accessibility: 100
- Best Practices: >90
- SEO: 100
Database Query Times (p95):
- Simple queries: <50ms
- Join queries: <100ms
- Aggregations: <500ms
Monitoring & Observability
Metrics
Application Metrics (DataDog):
- Request rate, error rate, duration (RED metrics)
- Active users, sessions
- Business metrics (orders, revenue)
- Custom metrics (cart abandonment, conversion rate)
Infrastructure Metrics:
- CPU, memory, disk usage
- Network throughput
- Database connections, query performance
- Cache hit rate
Dashboards:
- System Health: Overall system status
- API Performance: Endpoint-specific metrics
- Business Metrics: KPIs and conversions
- Database Performance: Query analysis
- Error Tracking: Error rates and trends
Logging
Log Levels:
- ERROR: Application errors requiring investigation
- WARN: Potential issues or degraded performance
- INFO: Significant events (order created, payment succeeded)
- DEBUG: Detailed diagnostic information (disabled in production)
Log Aggregation: CloudWatch Logs → DataDog
Structured Logging:
logger.info('Order created', {
orderId: '123',
customerId: '456',
total: 99.99,
timestamp: new Date().toISOString()
});
Alerting
Alert Channels:
- Critical: PagerDuty (SMS + Phone)
- High: Slack #incidents
- Medium: Slack #engineering
- Low: Email
Alert Rules:
- name: High Error Rate
condition: error_rate > 5% for 5 minutes
severity: CRITICAL
channel: PagerDuty
- name: Slow API Response
condition: p95_latency > 1000ms for 10 minutes
severity: HIGH
channel: Slack
- name: Database Connection Pool Exhausted
condition: db_connections > 180 for 5 minutes
severity: CRITICAL
channel: PagerDuty
- name: Low Cache Hit Rate
condition: cache_hit_rate < 70% for 15 minutes
severity: MEDIUM
channel: Slack
Tracing
Distributed Tracing: DataDog APM
Trace Example:
HTTP Request: GET /api/products/123
├─ Authentication Middleware (5ms)
├─ Authorization Middleware (2ms)
├─ Product Service
│ ├─ Cache Lookup (1ms) [MISS]
│ ├─ Database Query (45ms)
│ └─ Cache Set (2ms)
├─ Response Serialization (3ms)
└─ Total: 58ms
Future Considerations
Planned Improvements (Next 6 Months)
-
Microservices Migration
- Extract payment service first
- Event-driven architecture with message queue
- Service mesh (Istio) for inter-service communication
-
Search Enhancement
- Migrate to Elasticsearch
- Implement faceted search
- Add product recommendations (ML-based)
-
Performance Optimization
- Implement GraphQL (reduce over-fetching)
- Server-side rendering for better SEO
- Optimize database queries (20% improvement target)
-
Infrastructure
- Multi-region deployment for lower latency
- Kubernetes migration (from ECS)
- Serverless functions for background jobs
Technical Debt
High Priority:
- Upgrade Node.js from v16 to v20
- Migrate from class components to hooks (React)
- Implement comprehensive integration tests
- Refactor legacy authentication code
Medium Priority:
- Standardize error handling across services
- Improve API documentation (OpenAPI spec)
- Add end-to-end tests for critical flows
Low Priority:
- Migrate from REST to GraphQL
- Implement BFF (Backend for Frontend) pattern
- Add feature flags system
Risks & Mitigation
| Risk | Impact | Probability | Mitigation |
|---|---|---|---|
| Database becomes bottleneck | HIGH | MEDIUM | Read replicas, caching, sharding plan |
| Monolith difficult to scale | MEDIUM | HIGH | Modular architecture, migration plan |
| Third-party service outage | HIGH | LOW | Fallback strategies, circuit breakers |
| Security breach | CRITICAL | LOW | Regular audits, penetration testing |
| Key engineer departure | MEDIUM | MEDIUM | Documentation, knowledge sharing |
Appendices
Glossary
- GMV: Gross Merchandise Value
- RPS: Requests Per Second
- p95: 95th percentile
- TTL: Time To Live
- CDN: Content Delivery Network
- WAF: Web Application Firewall
References
Change Log
| Version | Date | Changes | Author |
|---|---|---|---|
| 2.3 | 2024-01-15 | Added canary deployment strategy | Alice |
| 2.2 | 2023-12-01 | Updated infrastructure (ECS migration) | Bob |
| 2.1 | 2023-10-15 | Added security architecture section | Frank |
| 2.0 | 2023-09-01 | Major revision - microservices plan | Alice, Bob |
Document Status: Current Next Review: April 15, 2024 Maintained By: Engineering Team Questions: #architecture on Slack
### Architecture Decision Record (ADR) Template
```markdown
# ADR-015: Migrate from Sessions to JWT Authentication
**Status**: Accepted
**Date**: January 15, 2024
**Decision Makers**: Alice (EM), Bob (Tech Lead), Carol (Frontend Lead)
**Consulted**: Security Team, DevOps Team
---
## Context
Our current authentication system uses server-side sessions stored in Redis. As we scale to support more users and prepare for multi-region deployment, session management has become a bottleneck.
### Current State
**Session-Based Authentication**:
```javascript
// Login creates server-side session
app.post('/login', (req, res) => {
const user = authenticate(req.body);
req.session.userId = user.id; // Stored in Redis
res.json({ success: true });
});
// Each request validates session
app.use((req, res, next) => {
if (req.session.userId) {
req.user = await getUser(req.session.userId);
}
next();
});
Problems:
- Scalability: Every request requires Redis lookup (adds 5-10ms latency)
- Complexity: Session replication across regions is complex
- Memory: 50,000 active sessions = 250MB Redis memory
- Stateful: Cannot easily add new servers (sticky sessions required)
Requirements
- Stateless: No server-side session storage
- Scalable: Support 50k+ concurrent users
- Secure: Resistant to common attacks (XSS, CSRF, token theft)
- Fast: Minimal performance impact (<1ms overhead)
- Compatible: Work with existing mobile apps
Decision
We will migrate from session-based authentication to JSON Web Tokens (JWT).
Implementation
JWT-Based Authentication:
// Login generates JWT
app.post('/login', (req, res) => {
const user = authenticate(req.body);
const accessToken = jwt.sign(
{ userId: user.id, role: user.role },
process.env.JWT_SECRET,
{ expiresIn: '15m' }
);
const refreshToken = jwt.sign(
{ userId: user.id },
process.env.REFRESH_SECRET,
{ expiresIn: '7d' }
);
res.json({ accessToken, refreshToken });
});
// Each request validates JWT (no database lookup)
app.use((req, res, next) => {
const token = req.headers.authorization?.split(' ')[1];
try {
req.user = jwt.verify(token, process.env.JWT_SECRET);
next();
} catch (error) {
res.status(401).json({ error: 'Invalid token' });
}
});
Token Structure
Access Token (short-lived):
- Payload:
{ userId, role, permissions } - Expiration: 15 minutes
- Signature: HMAC SHA256
Refresh Token (long-lived):
- Payload:
{ userId } - Expiration: 7 days
- Stored hash in database (for revocation)
Alternatives Considered
Alternative 1: Keep Session-Based Auth
Pros:
- No migration needed
- Familiar to team
- Easy to revoke access (delete session)
Cons:
- Scalability issues persist
- Complex multi-region setup
- Requires sticky sessions (load balancer complexity)
Decision: Rejected due to scalability concerns
Alternative 2: OAuth 2.0 Only
Pros:
- Industry standard
- Delegation capabilities
- Well-tested security
Cons:
- Overkill for our use case
- Complex implementation
- Requires authorization server
- Users expect username/password
Decision: Rejected - too complex for current needs. Will add OAuth as option later.
Alternative 3: API Keys
Pros:
- Simple implementation
- Stateless
- Easy to revoke
Cons:
- No expiration (security risk)
- Not suitable for user authentication
- No claims/scopes
Decision: Rejected - better suited for programmatic access, not user auth
Consequences
Positive
-
Performance: Eliminate Redis lookup on every request
- Estimated improvement: 5-10ms per request
- Reduces Redis load by 80%
-
Scalability: Stateless servers
- No sticky sessions needed
- Easy horizontal scaling
- Multi-region deployment simplified
-
Mobile Support: Better mobile app experience
- Tokens stored locally
- No cookies required
- Offline token validation
-
Developer Experience: Simpler architecture
- No session middleware
- Easier testing (no session state)
- Clear token lifecycle
Negative
-
Token Revocation: Cannot immediately revoke access
- Mitigation: Short token expiration (15 min)
- Mitigation: Refresh token blacklist
- Mitigation: Emergency: force re-auth for all users
-
Token Size: JWTs larger than session IDs
- Session ID: ~32 bytes
- JWT: ~200 bytes
- Impact: Minimal (200 bytes per request is acceptable)
-
Secret Management: JWT secrets are critical
- Mitigation: Store in AWS Secrets Manager
- Mitigation: Rotate secrets quarterly
- Mitigation: Different secrets per environment
-
XSS Risk: Tokens accessible to JavaScript
- Mitigation: Store in httpOnly cookies (where possible)
- Mitigation: Strict Content Security Policy
- Mitigation: Short token expiration
Risks
| Risk | Severity | Mitigation |
|---|---|---|
| JWT secret leaked | CRITICAL | Secrets Manager, rotation, monitoring |
| Cannot revoke compromised token | HIGH | Short expiration, refresh token blacklist |
| Algorithm confusion attack | MEDIUM | Explicitly specify algorithm in verification |
| Replay attacks | MEDIUM | Short expiration, HTTPS only |
Implementation Plan
Phase 1: Preparation (Week 1-2)
- Create JWT utility functions
- Update authentication middleware
- Add refresh token endpoint
- Write migration guide for frontend team
- Set up secrets in AWS Secrets Manager
Phase 2: Backend Migration (Week 3-4)
- Deploy JWT endpoints alongside session endpoints
- Add feature flag for JWT authentication
- Comprehensive testing (unit + integration)
- Load testing with JWTs
- Security review
Phase 3: Frontend Migration (Week 5-6)
- Update web app to use JWT
- Update mobile apps to use JWT
- Gradual rollout (10% → 50% → 100%)
- Monitor error rates and performance
Phase 4: Cleanup (Week 7-8)
- Remove session-based auth code
- Remove Redis session storage
- Update documentation
- Postmortem and lessons learned
Rollback Plan
If critical issues arise:
- Disable JWT feature flag
- Route all traffic to session endpoints
- Keep JWT code for investigation
- Identify and fix issues
- Resume migration
Metrics for Success
Performance:
- Average request latency reduced by 5ms
- p95 latency reduced by 10ms
- Redis CPU usage reduced by 80%
Reliability:
- No increase in authentication errors
- <0.1% token validation failures
- Zero security incidents
User Experience:
- Login flow unchanged (transparent migration)
- No increase in support tickets
- Mobile app performance improved
Security Considerations
Token Security:
- Tokens signed with HS256 (HMAC SHA256)
- Secrets: 256-bit randomly generated
- Secrets rotated quarterly
- Algorithm specified in verification (prevent algorithm confusion)
Storage:
- Web: httpOnly cookies (prevents XSS)
- Mobile: Secure storage (Keychain/Keystore)
- Never in localStorage (XSS vulnerable)
Transmission:
- HTTPS only (TLS 1.3)
- Secure, SameSite=Strict cookies
- No tokens in URLs (log exposure)
Validation:
- Verify signature
- Check expiration
- Validate issuer and audience
- Check token not blacklisted (refresh tokens)
References
Updates
| Date | Update | Author |
|---|---|---|
| 2024-01-15 | Initial ADR created | Bob |
| 2024-01-20 | Added security review feedback | Frank |
| 2024-02-01 | Updated after implementation | Bob |
Status: Accepted Supersedes: ADR-008 (Session-based Authentication) Related: ADR-012 (API Security), ADR-014 (Multi-region Deployment)
## Usage Examples
@architecture-documenter @architecture-documenter --type system-overview @architecture-documenter --type adr @architecture-documenter --focus security @architecture-documenter --focus scalability @architecture-documenter --include-diagrams @architecture-documenter --update-existing
## Best Practices
### Document Architecture Decisions
**When to create an ADR**:
- Significant technical decisions
- Architecture changes
- Technology choices
- Process changes
- Security decisions
**ADR Structure**:
1. **Context**: What's the situation?
2. **Decision**: What did we decide?
3. **Alternatives**: What else did we consider?
4. **Consequences**: What are the impacts?
### Use Visual Diagrams
**Diagram Types**:
- **System Context**: Show system boundaries
- **Container**: Show high-level architecture
- **Component**: Show internal structure
- **Code**: Show class/module relationships
- **Deployment**: Show infrastructure
- **Sequence**: Show interactions over time
**Tools**:
- Diagrams as code: Mermaid, PlantUML
- Visual tools: Lucidchart, Draw.io
- Cloud-specific: AWS Architecture Diagrams
### Keep Documentation Current
**Documentation Lifecycle**:
- Create during design phase
- Review in code review
- Update with implementation changes
- Quarterly architecture review
- Archive outdated docs (don't delete)
**Version Control**:
- Store docs with code
- Version alongside releases
- Link docs to specific code versions
- Maintain changelog
### Make It Discoverable
**Organization**:
- Central location (wiki, docs folder)
- Clear naming conventions
- Table of contents
- Cross-references
- Search-friendly
**Accessibility**:
- Public within organization
- Easy to navigate
- Multiple entry points
- Links from README
## Notes
- Architecture documentation is for communication, not perfection
- Diagrams speak louder than words - use them liberally
- ADRs capture decisions and context for future reference
- Keep docs synchronized with code changes
- Version architecture docs alongside code
- Regular reviews prevent documentation drift
- Good architecture docs reduce onboarding time significantly
- Document the "why" not just the "what"
- Include trade-offs and alternatives considered
- Make security and scalability explicit
- Link architecture to business goals
- Use consistent notation and terminology
Quick Install
/plugin add https://github.com/CuriousLearner/devkit/tree/main/architecture-documenterCopy and paste this command in Claude Code to install this skill
GitHub 仓库
Related Skills
langchain
MetaLangChain is a framework for building LLM applications using agents, chains, and RAG pipelines. It supports multiple LLM providers, offers 500+ integrations, and includes features like tool calling and memory management. Use it for rapid prototyping and deploying production systems like chatbots, autonomous agents, and question-answering services.
Algorithmic Art Generation
MetaThis skill helps developers create algorithmic art using p5.js, focusing on generative art, computational aesthetics, and interactive visualizations. It automatically activates for topics like "generative art" or "p5.js visualization" and guides you through creating unique algorithms with features like seeded randomness, flow fields, and particle systems. Use it when you need to build reproducible, code-driven artistic patterns.
webapp-testing
TestingThis Claude Skill provides a Playwright-based toolkit for testing local web applications through Python scripts. It enables frontend verification, UI debugging, screenshot capture, and log viewing while managing server lifecycles. Use it for browser automation tasks but run scripts directly rather than reading their source code to avoid context pollution.
requesting-code-review
DesignThis skill dispatches a code-reviewer subagent to analyze code changes against requirements before proceeding. It should be used after completing tasks, implementing major features, or before merging to main. The review helps catch issues early by comparing the current implementation with the original plan.
