architecture-documenter

CuriousLearner

Updated Today

19 views

Designworddesign

About

The architecture-documenter skill analyzes software systems to create comprehensive architecture documentation and technical decision records. It generates system overviews, diagrams, and ADRs to explain components, data flows, and design trade-offs. Use this skill to document architecture for effective team communication, knowledge sharing, and decision tracking.

Documentation

Architecture Documenter Skill

Document system architecture and technical design decisions for effective team communication and knowledge sharing.

Instructions

You are a software architecture documentation expert. When invoked:

Analyze System Architecture:
- Identify key components and services
- Understand data flows and interactions
- Map dependencies and integrations
- Recognize architectural patterns
- Assess scalability and reliability
Create Architecture Documentation:
- System overview and context
- Component diagrams and relationships
- Data flow diagrams
- Deployment architecture
- Security architecture
- Decision records (ADRs)
Document Technical Decisions:
- What was decided
- Why it was decided
- Alternatives considered
- Trade-offs made
- Implementation details
- Future considerations
Use Visual Diagrams:
- System architecture diagrams
- Sequence diagrams
- Entity-relationship diagrams
- Infrastructure diagrams
- Network topology
- State machines
Maintain Living Documentation:
- Keep docs synchronized with code
- Version architecture docs
- Track evolution over time
- Mark deprecated components
- Update with lessons learned

Architecture Documentation Templates

System Architecture Document Template

# E-Commerce Platform - System Architecture

**Version**: 2.3
**Last Updated**: January 15, 2024
**Status**: Current
**Authors**: Engineering Team
**Reviewers**: Alice (EM), Bob (Tech Lead)

---

## Table of Contents

1. [Executive Summary](#executive-summary)
2. [System Context](#system-context)
3. [Architecture Overview](#architecture-overview)
4. [Core Components](#core-components)
5. [Data Architecture](#data-architecture)
6. [Infrastructure](#infrastructure)
7. [Security Architecture](#security-architecture)
8. [Scalability & Performance](#scalability--performance)
9. [Deployment](#deployment)
10. [Monitoring & Observability](#monitoring--observability)
11. [Future Considerations](#future-considerations)

---

## Executive Summary

### What This System Does

The E-Commerce Platform is a modern, cloud-native application that enables small to medium businesses to sell products online. It handles the complete e-commerce lifecycle from product catalog management to order fulfillment.

### Key Capabilities

- **Product Management**: Create, update, and manage product catalogs
- **Shopping Experience**: Browse products, search, filter, and compare
- **Checkout & Payments**: Secure checkout with multiple payment options
- **Order Management**: Track orders from placement to delivery
- **User Accounts**: Customer profiles, order history, preferences
- **Admin Dashboard**: Business analytics, inventory management

### System Scale

| Metric | Current | Target (6 months) |
|--------|---------|-------------------|
| Active Users | 5,000 businesses | 15,000 businesses |
| Products | 500,000 | 2,000,000 |
| Daily Orders | 10,000 | 50,000 |
| Monthly GMV | $2M | $10M |
| Peak RPS | 500 | 2,000 |
| Data Storage | 2 TB | 10 TB |

### Technology Stack Summary

- **Frontend**: React, TypeScript, Redux, Material-UI
- **Backend**: Node.js, Express, TypeScript
- **Database**: PostgreSQL (primary), Redis (cache)
- **Storage**: AWS S3
- **Hosting**: AWS (ECS, RDS, ElastiCache, CloudFront)
- **CI/CD**: GitHub Actions
- **Monitoring**: DataDog, Sentry

---

## System Context

### Business Context

**Problem We Solve**: Small businesses struggle with expensive, complex e-commerce solutions. Our platform provides an affordable, easy-to-use alternative.

**Target Users**:
- Small business owners (10-1000 products)
- Digital creators selling physical products
- Retail stores expanding online

**Business Model**: SaaS subscription ($29-$299/month) + transaction fees (2.9% + $0.30)

### System Boundary

┌─────────────────────────────────────────────────────┐ │ E-Commerce Platform │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │ │ │ Customer │ │ Merchant │ │ Admin │ │ │ │ Web │ │Dashboard │ │ Portal │ │ │ └──────────┘ └──────────┘ └──────────────┘ │ │ │ │ ┌──────────────────────────────────────────────┐ │ │ │ Backend Services │ │ │ │ (Auth, Product, Order, Payment, etc.) │ │ │ └──────────────────────────────────────────────┘ │ │ │ │ ┌──────────────────────────────────────────────┐ │ │ │ Data & Storage Layer │ │ │ │ (PostgreSQL, Redis, S3) │ │ │ └──────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────┘ │ ┌───────────────┼───────────────┐ │ │ │ ┌─────────┐ ┌──────────┐ ┌──────────┐ │ Stripe │ │ SendGrid│ │ Shippo │ │ Payment │ │ Email │ │ Shipping │ └─────────┘ └──────────┘ └──────────┘


### External Dependencies

| Service | Purpose | SLA | Fallback Strategy |
|---------|---------|-----|-------------------|
| Stripe | Payment processing | 99.99% | Queue retries, manual processing |
| SendGrid | Email delivery | 99.95% | Alternative provider (AWS SES) |
| Shippo | Shipping labels | 99.9% | Manual label generation |
| AWS | Infrastructure | 99.99% | Multi-AZ deployment |
| Cloudflare | CDN/DNS | 99.99% | Direct origin access |

---

## Architecture Overview

### High-Level Architecture

                    Internet
                       │
                       ▼
                ┌──────────────┐
                │  Cloudflare  │ (CDN, DDoS protection)
                └──────┬───────┘
                       │
                       ▼
            ┌──────────────────────┐
            │   AWS CloudFront     │ (Static assets)
            └──────────────────────┘
                       │
    ┌──────────────────┼──────────────────┐
    │                  │                  │
    ▼                  ▼                  ▼

┌───────────────┐ ┌───────────────┐ ┌──────────────┐ │ React │ │ API Gateway │ │ Admin │ │ Frontend │ │ (Express) │ │ Portal │ │ (CloudFront) │ │ (ALB+ECS) │ │ │ └───────────────┘ └───────┬───────┘ └──────────────┘ │ ┌──────────────────┼──────────────────┐ │ │ │ ▼ ▼ ▼ ┌─────────┐ ┌──────────┐ ┌──────────┐ │ Auth │ │ Product │ │ Order │ │ Service │ │ Service │ │ Service │ └────┬────┘ └────┬─────┘ └────┬─────┘ │ │ │ └────────────────┼──────────────────┘ │ ┌────────────────┼────────────────┐ │ │ │ ▼ ▼ ▼ ┌─────────┐ ┌─────────┐ ┌─────────┐ │PostgreSQL│ │ Redis │ │ S3 │ │ (RDS) │ │(ElastiCache) │(Images) │ └──────────┘ └─────────┘ └─────────┘


### Architecture Style

**Primary Pattern**: Modular Monolith (transitioning to Microservices)

**Rationale**:
- **Current**: Modular monolith provides simplicity while maintaining clear boundaries
- **Future**: Easy migration path to microservices as scale increases
- **Trade-off**: Accepts coupling cost for development velocity at current scale

### Key Architectural Principles

1. **Separation of Concerns**: Clear boundaries between modules
2. **API-First**: All features exposed via REST APIs
3. **Stateless Services**: No server-side session state (JWT-based auth)
4. **Caching Strategy**: Cache aggressively, invalidate carefully
5. **Eventual Consistency**: Accept eventual consistency for non-critical data
6. **Fail Fast**: Return errors quickly rather than retry indefinitely
7. **Observability**: Comprehensive logging, metrics, and tracing

---

## Core Components

### Frontend Application

**Technology**: React 18 + TypeScript + Redux Toolkit

**Structure**:

client/ ├── components/ # Reusable UI components ├── pages/ # Route-level pages ├── store/ # Redux state management ├── api/ # API client ├── hooks/ # Custom React hooks └── utils/ # Utility functions


**Key Features**:
- Server-side rendering (SSR) for SEO
- Code splitting by route
- Progressive Web App (PWA) capabilities
- Optimistic UI updates
- Offline support (service workers)

**State Management**:
- **Redux**: Global application state
- **React Query**: Server state caching
- **Local Storage**: User preferences, cart (guest users)

**Performance Targets**:
- First Contentful Paint: <1.5s
- Time to Interactive: <3s
- Lighthouse Score: >90

---

### API Gateway

**Technology**: Express.js + TypeScript

**Responsibilities**:
- Request routing
- Authentication/authorization
- Rate limiting
- Request/response transformation
- API versioning
- CORS handling

**Middleware Pipeline**:
```javascript
Request
  ↓
Logging (Morgan)
  ↓
Rate Limiting (express-rate-limit)
  ↓
CORS (cors)
  ↓
Authentication (JWT verification)
  ↓
Authorization (permission check)
  ↓
Request Validation (Joi)
  ↓
Route Handler
  ↓
Response Formatting
  ↓
Error Handling
  ↓
Response

API Versioning Strategy:

URL versioning: /api/v1/products, /api/v2/products
Maintain 2 versions simultaneously
Deprecation warnings in headers
6-month sunset period for old versions

Service Modules

Authentication Service

Responsibilities:

User registration and login
JWT token generation and validation
Password reset flow
OAuth integration (Google, Facebook)
Multi-factor authentication (MFA)

Database Schema:

users (
  id UUID PRIMARY KEY,
  email VARCHAR UNIQUE NOT NULL,
  password_hash VARCHAR NOT NULL,
  email_verified BOOLEAN DEFAULT FALSE,
  created_at TIMESTAMP,
  updated_at TIMESTAMP
)

sessions (
  id UUID PRIMARY KEY,
  user_id UUID REFERENCES users(id),
  token_hash VARCHAR NOT NULL,
  expires_at TIMESTAMP,
  created_at TIMESTAMP
)

oauth_accounts (
  id UUID PRIMARY KEY,
  user_id UUID REFERENCES users(id),
  provider VARCHAR NOT NULL, -- 'google', 'facebook'
  provider_user_id VARCHAR NOT NULL,
  access_token VARCHAR,
  refresh_token VARCHAR,
  UNIQUE(provider, provider_user_id)
)

Security Measures:

Passwords hashed with Argon2id
JWT tokens with 15-minute expiration
Refresh tokens with 7-day expiration
Rate limiting: 5 login attempts per 15 minutes
Account lockout after 10 failed attempts
MFA via TOTP (Google Authenticator)

Product Service

Responsibilities:

Product CRUD operations
Inventory management
Search and filtering
Product recommendations
Category management

Database Schema:

products (
  id UUID PRIMARY KEY,
  merchant_id UUID REFERENCES users(id),
  name VARCHAR NOT NULL,
  description TEXT,
  price DECIMAL(10,2) NOT NULL,
  inventory_count INTEGER NOT NULL DEFAULT 0,
  category_id UUID REFERENCES categories(id),
  status VARCHAR DEFAULT 'draft', -- draft, active, archived
  created_at TIMESTAMP,
  updated_at TIMESTAMP
)

product_images (
  id UUID PRIMARY KEY,
  product_id UUID REFERENCES products(id) ON DELETE CASCADE,
  url VARCHAR NOT NULL,
  position INTEGER,
  created_at TIMESTAMP
)

categories (
  id UUID PRIMARY KEY,
  name VARCHAR NOT NULL,
  parent_id UUID REFERENCES categories(id),
  slug VARCHAR UNIQUE NOT NULL
)

Search Implementation:

PostgreSQL full-text search with trigram indexes
Elasticsearch for advanced features (planned)
Caching: 5-minute TTL for product lists, 1-hour for individual products

Performance Optimizations:

Database indexes on common query fields
N+1 query prevention with eager loading
Image CDN with automatic resizing
Aggressive caching with Redis

Order Service

Responsibilities:

Shopping cart management
Order creation and processing
Order status tracking
Order history
Invoice generation

Database Schema:

orders (
  id UUID PRIMARY KEY,
  customer_id UUID REFERENCES users(id),
  status VARCHAR NOT NULL, -- pending, paid, shipped, delivered, cancelled
  subtotal DECIMAL(10,2) NOT NULL,
  tax DECIMAL(10,2) NOT NULL,
  shipping DECIMAL(10,2) NOT NULL,
  total DECIMAL(10,2) NOT NULL,
  payment_id VARCHAR,
  shipping_address_id UUID REFERENCES addresses(id),
  created_at TIMESTAMP,
  updated_at TIMESTAMP
)

order_items (
  id UUID PRIMARY KEY,
  order_id UUID REFERENCES orders(id) ON DELETE CASCADE,
  product_id UUID REFERENCES products(id),
  quantity INTEGER NOT NULL,
  price DECIMAL(10,2) NOT NULL,
  product_snapshot JSONB -- Product details at time of purchase
)

order_events (
  id UUID PRIMARY KEY,
  order_id UUID REFERENCES orders(id),
  event_type VARCHAR NOT NULL, -- created, paid, shipped, etc.
  metadata JSONB,
  created_at TIMESTAMP
)

Order State Machine:

pending → paid → processing → shipped → delivered
   ↓       ↓         ↓           ↓
   └───────┴─────────┴───────────┴──→ cancelled

Transaction Handling:

Database transactions for order creation
Idempotency keys for payment processing
Inventory reservation system
Automatic rollback on payment failure

Payment Service

Responsibilities:

Payment intent creation
Payment processing (via Stripe)
Refund handling
Payment method management
Transaction history

Integration with Stripe:

// Payment Intent Flow
1. Client requests payment intent
   ↓
2. Server creates Stripe PaymentIntent
   ↓
3. Client collects payment details
   ↓
4. Client confirms payment with Stripe
   ↓
5. Stripe webhook notifies server
   ↓
6. Server updates order status

Webhook Security:

Stripe signature verification
Idempotent webhook processing
Async processing with job queue
Retry logic for failed webhooks

Database Schema:

payments (
  id UUID PRIMARY KEY,
  order_id UUID REFERENCES orders(id),
  stripe_payment_intent_id VARCHAR UNIQUE,
  amount DECIMAL(10,2) NOT NULL,
  status VARCHAR NOT NULL, -- pending, succeeded, failed
  payment_method VARCHAR, -- card, bank_transfer
  metadata JSONB,
  created_at TIMESTAMP,
  updated_at TIMESTAMP
)

refunds (
  id UUID PRIMARY KEY,
  payment_id UUID REFERENCES payments(id),
  stripe_refund_id VARCHAR UNIQUE,
  amount DECIMAL(10,2) NOT NULL,
  reason VARCHAR,
  status VARCHAR NOT NULL,
  created_at TIMESTAMP
)

Data Architecture

Database Design

Primary Database: PostgreSQL 14

Schema Organization:

public schema: Core application tables
audit schema: Audit logs and event sourcing
analytics schema: Denormalized data for reporting

Connection Pooling:

{
  max: 20,              // Max connections
  min: 5,               // Min connections
  idle: 10000,          // Close idle connections after 10s
  acquire: 30000,       // Max time to acquire connection
  evict: 1000           // Check for idle connections every 1s
}

Backup Strategy:

Automated daily backups (RDS snapshots)
Point-in-time recovery enabled (7-day window)
Monthly backups retained for 1 year
Backup tested quarterly

Caching Strategy

Redis Configuration:

Deployment: AWS ElastiCache (Redis 7.0)
Mode: Cluster mode enabled
Nodes: 3 (primary + 2 replicas)
Eviction policy: LRU (Least Recently Used)

Cache Patterns:

Cache-Aside (Read-heavy data):

async function getProduct(id) {
  // Try cache first
  let product = await cache.get(`product:${id}`);

  if (!product) {
    // Cache miss - fetch from database
    product = await db.products.findById(id);

    // Store in cache (1 hour TTL)
    await cache.set(`product:${id}`, product, 3600);
  }

  return product;
}

Write-Through (Critical data):

async function updateProduct(id, data) {
  // Update database
  const product = await db.products.update(id, data);

  // Update cache
  await cache.set(`product:${id}`, product, 3600);

  return product;
}

Cache Invalidation:

// Product updated
await cache.del(`product:${productId}`);
await cache.del(`products:merchant:${merchantId}`);
await cache.del(`products:category:${categoryId}`);

// Pattern-based invalidation
await cache.delPattern(`products:*`);

What We Cache:

Data Type	TTL	Rationale
Product details	1 hour	Infrequently updated
Product lists	5 minutes	Frequently updated
User sessions	15 minutes	Security requirement
Search results	10 minutes	Expensive queries
API responses	1 minute	Rate limit protection

Data Migration Strategy

Tools: Prisma Migrate (development), custom scripts (production)

Migration Process:

Create migration in development
Review SQL in PR
Test on staging (copy of production data)
Run on production during low-traffic window
Rollback plan documented

Zero-Downtime Migrations:

-- Example: Adding non-null column

-- Step 1: Add column as nullable
ALTER TABLE products ADD COLUMN new_field VARCHAR;

-- Step 2: Backfill data
UPDATE products SET new_field = 'default_value' WHERE new_field IS NULL;

-- Step 3: Add NOT NULL constraint
ALTER TABLE products ALTER COLUMN new_field SET NOT NULL;

Infrastructure

AWS Architecture

Regions: Primary: us-east-1, Disaster Recovery: us-west-2

VPC Design:

VPC (10.0.0.0/16)
├── Public Subnets (10.0.1.0/24, 10.0.2.0/24)
│   ├── NAT Gateways
│   └── Application Load Balancer
└── Private Subnets (10.0.10.0/24, 10.0.11.0/24)
    ├── ECS Tasks (Application)
    ├── RDS (Database)
    └── ElastiCache (Redis)

Compute:

ECS Fargate: Serverless containers for application
Auto-scaling: Target CPU 70%, min 2 tasks, max 10 tasks

Task Definition:

CPU: 1024 (1 vCPU)
Memory: 2048 MB
Container Port: 3000
Environment: Production

Database:

RDS PostgreSQL: db.r5.large (2 vCPU, 16 GB RAM)
Multi-AZ: Yes (automatic failover)
Read Replicas: 1 (for analytics queries)
Storage: 500 GB GP3 (auto-scaling enabled)

Storage:

S3 Bucket: product-images-prod
Lifecycle Policy: Move to Glacier after 90 days
CDN: CloudFront distribution for images
Backup: Cross-region replication enabled

Networking:

Load Balancer: Application Load Balancer (ALB)
SSL/TLS: ACM certificates (auto-renewal)
WAF: AWS WAF with OWASP rules
DDoS Protection: AWS Shield Standard

Deployment Architecture

CI/CD Pipeline (GitHub Actions):

Code Push
  ↓
Automated Tests (Unit + Integration)
  ↓
Linting & Type Checking
  ↓
Build Docker Image
  ↓
Push to ECR (Elastic Container Registry)
  ↓
Deploy to Staging (Auto)
  ↓
Integration Tests (Staging)
  ↓
Manual Approval
  ↓
Deploy to Production (Canary)
  ↓
Monitor Metrics (15 minutes)
  ↓
Full Rollout or Rollback

Deployment Strategy: Blue-Green with Canary

Production (Blue)        Canary (Green)
100% traffic    →   95% / 5% split   →   0% / 100%
                    ↓
              Monitor for 15 min
                    ↓
              Success? Full rollout : Rollback

Rollback Procedure:

Detect issue (automated alerts or manual)
Trigger rollback command
Route traffic back to previous version
Investigate root cause
Fix and redeploy

Deployment Windows:

Staging: Anytime
Production: Tuesday-Thursday, 10 AM - 2 PM EST
Emergency: 24/7 with on-call approval

Security Architecture

Defense in Depth

Layer 1: Network Security

VPC isolation
Security groups (allow-list only)
Network ACLs
Private subnets for data layer
NAT Gateway for outbound traffic

Layer 2: Application Security

Input validation (all user inputs)
SQL injection prevention (parameterized queries)
XSS prevention (sanitization + CSP headers)
CSRF protection (tokens)
Rate limiting (DDoS mitigation)

Layer 3: Authentication & Authorization

JWT with short expiration
Refresh token rotation
MFA for admin accounts
Role-based access control (RBAC)
Principle of least privilege

Layer 4: Data Security

Encryption at rest (RDS, S3)
Encryption in transit (TLS 1.3)
Secrets in AWS Secrets Manager
PII data encrypted at field level
Regular security audits

Security Headers

{
  'Strict-Transport-Security': 'max-age=31536000; includeSubDomains',
  'Content-Security-Policy': "default-src 'self'; script-src 'self' 'unsafe-inline'",
  'X-Frame-Options': 'DENY',
  'X-Content-Type-Options': 'nosniff',
  'Referrer-Policy': 'strict-origin-when-cross-origin',
  'Permissions-Policy': 'geolocation=(), microphone=(), camera=()'
}

Compliance

Standards:

PCI DSS: Level 2 (Stripe handles Level 1)
GDPR: User data rights, deletion, export
SOC 2 Type II: In progress (Q2 2024)

Data Retention:

User data: Retained until account deletion
Order data: 7 years (regulatory requirement)
Logs: 90 days
Backups: 1 year

Scalability & Performance

Current Capacity

Metric	Current	Limit	Headroom
Concurrent Users	500	2,000	4x
Requests/Second	200	1,000	5x
Database Connections	50	200	4x
Storage	500 GB	2 TB	4x

Scaling Strategy

Horizontal Scaling:

Stateless services (easy to replicate)
Auto-scaling based on CPU/memory
Database read replicas for read-heavy workloads

Vertical Scaling:

Database instance size (scheduled uptime)
Cache cluster size

Caching:

Application-level caching (Redis)
CDN for static assets
Database query result caching

Database Optimization:

Indexes on frequently queried fields
Materialized views for complex queries
Connection pooling
Query optimization (EXPLAIN ANALYZE)

Performance Budgets

API Response Times (p95):

GET requests: <200ms
POST requests: <500ms
Complex queries: <1s

Frontend Performance (Lighthouse):

Performance: >90
Accessibility: 100
Best Practices: >90
SEO: 100

Database Query Times (p95):

Simple queries: <50ms
Join queries: <100ms
Aggregations: <500ms

Monitoring & Observability

Metrics

Application Metrics (DataDog):

Request rate, error rate, duration (RED metrics)
Active users, sessions
Business metrics (orders, revenue)
Custom metrics (cart abandonment, conversion rate)

Infrastructure Metrics:

CPU, memory, disk usage
Network throughput
Database connections, query performance
Cache hit rate

Dashboards:

System Health: Overall system status
API Performance: Endpoint-specific metrics
Business Metrics: KPIs and conversions
Database Performance: Query analysis
Error Tracking: Error rates and trends

Logging

Log Levels:

ERROR: Application errors requiring investigation
WARN: Potential issues or degraded performance
INFO: Significant events (order created, payment succeeded)
DEBUG: Detailed diagnostic information (disabled in production)

Log Aggregation: CloudWatch Logs → DataDog

Structured Logging:

logger.info('Order created', {
  orderId: '123',
  customerId: '456',
  total: 99.99,
  timestamp: new Date().toISOString()
});

Alerting

Alert Channels:

Critical: PagerDuty (SMS + Phone)
High: Slack #incidents
Medium: Slack #engineering
Low: Email

Alert Rules:

- name: High Error Rate
  condition: error_rate > 5% for 5 minutes
  severity: CRITICAL
  channel: PagerDuty

- name: Slow API Response
  condition: p95_latency > 1000ms for 10 minutes
  severity: HIGH
  channel: Slack

- name: Database Connection Pool Exhausted
  condition: db_connections > 180 for 5 minutes
  severity: CRITICAL
  channel: PagerDuty

- name: Low Cache Hit Rate
  condition: cache_hit_rate < 70% for 15 minutes
  severity: MEDIUM
  channel: Slack

Tracing

Distributed Tracing: DataDog APM

Trace Example:

HTTP Request: GET /api/products/123
├─ Authentication Middleware (5ms)
├─ Authorization Middleware (2ms)
├─ Product Service
│  ├─ Cache Lookup (1ms) [MISS]
│  ├─ Database Query (45ms)
│  └─ Cache Set (2ms)
├─ Response Serialization (3ms)
└─ Total: 58ms

Future Considerations

Planned Improvements (Next 6 Months)

Microservices Migration
- Extract payment service first
- Event-driven architecture with message queue
- Service mesh (Istio) for inter-service communication
Search Enhancement
- Migrate to Elasticsearch
- Implement faceted search
- Add product recommendations (ML-based)
Performance Optimization
- Implement GraphQL (reduce over-fetching)
- Server-side rendering for better SEO
- Optimize database queries (20% improvement target)
Infrastructure
- Multi-region deployment for lower latency
- Kubernetes migration (from ECS)
- Serverless functions for background jobs

Technical Debt

High Priority:

Upgrade Node.js from v16 to v20
Migrate from class components to hooks (React)
Implement comprehensive integration tests
Refactor legacy authentication code

Medium Priority:

Standardize error handling across services
Improve API documentation (OpenAPI spec)
Add end-to-end tests for critical flows

Low Priority:

Migrate from REST to GraphQL
Implement BFF (Backend for Frontend) pattern
Add feature flags system

Risks & Mitigation

Risk	Impact	Probability	Mitigation
Database becomes bottleneck	HIGH	MEDIUM	Read replicas, caching, sharding plan
Monolith difficult to scale	MEDIUM	HIGH	Modular architecture, migration plan
Third-party service outage	HIGH	LOW	Fallback strategies, circuit breakers
Security breach	CRITICAL	LOW	Regular audits, penetration testing
Key engineer departure	MEDIUM	MEDIUM	Documentation, knowledge sharing

Appendices

Glossary

GMV: Gross Merchandise Value
RPS: Requests Per Second
p95: 95th percentile
TTL: Time To Live
CDN: Content Delivery Network
WAF: Web Application Firewall

References

Change Log

Version	Date	Changes	Author
2.3	2024-01-15	Added canary deployment strategy	Alice
2.2	2023-12-01	Updated infrastructure (ECS migration)	Bob
2.1	2023-10-15	Added security architecture section	Frank
2.0	2023-09-01	Major revision - microservices plan	Alice, Bob

Document Status: Current Next Review: April 15, 2024 Maintained By: Engineering Team Questions: #architecture on Slack


### Architecture Decision Record (ADR) Template

```markdown
# ADR-015: Migrate from Sessions to JWT Authentication

**Status**: Accepted
**Date**: January 15, 2024
**Decision Makers**: Alice (EM), Bob (Tech Lead), Carol (Frontend Lead)
**Consulted**: Security Team, DevOps Team

---

## Context

Our current authentication system uses server-side sessions stored in Redis. As we scale to support more users and prepare for multi-region deployment, session management has become a bottleneck.

### Current State

**Session-Based Authentication**:
```javascript
// Login creates server-side session
app.post('/login', (req, res) => {
  const user = authenticate(req.body);
  req.session.userId = user.id;  // Stored in Redis
  res.json({ success: true });
});

// Each request validates session
app.use((req, res, next) => {
  if (req.session.userId) {
    req.user = await getUser(req.session.userId);
  }
  next();
});

Problems:

Scalability: Every request requires Redis lookup (adds 5-10ms latency)
Complexity: Session replication across regions is complex
Memory: 50,000 active sessions = 250MB Redis memory
Stateful: Cannot easily add new servers (sticky sessions required)

Requirements

Stateless: No server-side session storage
Scalable: Support 50k+ concurrent users
Secure: Resistant to common attacks (XSS, CSRF, token theft)
Fast: Minimal performance impact (<1ms overhead)
Compatible: Work with existing mobile apps

Decision

We will migrate from session-based authentication to JSON Web Tokens (JWT).

Implementation

JWT-Based Authentication:

// Login generates JWT
app.post('/login', (req, res) => {
  const user = authenticate(req.body);

  const accessToken = jwt.sign(
    { userId: user.id, role: user.role },
    process.env.JWT_SECRET,
    { expiresIn: '15m' }
  );

  const refreshToken = jwt.sign(
    { userId: user.id },
    process.env.REFRESH_SECRET,
    { expiresIn: '7d' }
  );

  res.json({ accessToken, refreshToken });
});

// Each request validates JWT (no database lookup)
app.use((req, res, next) => {
  const token = req.headers.authorization?.split(' ')[1];

  try {
    req.user = jwt.verify(token, process.env.JWT_SECRET);
    next();
  } catch (error) {
    res.status(401).json({ error: 'Invalid token' });
  }
});

Token Structure

Access Token (short-lived):

Payload: { userId, role, permissions }
Expiration: 15 minutes
Signature: HMAC SHA256

Refresh Token (long-lived):

Payload: { userId }
Expiration: 7 days
Stored hash in database (for revocation)

Alternatives Considered

Alternative 1: Keep Session-Based Auth

Pros:

No migration needed
Familiar to team
Easy to revoke access (delete session)

Cons:

Scalability issues persist
Complex multi-region setup
Requires sticky sessions (load balancer complexity)

Decision: Rejected due to scalability concerns

Alternative 2: OAuth 2.0 Only

Pros:

Industry standard
Delegation capabilities
Well-tested security

Cons:

Overkill for our use case
Complex implementation
Requires authorization server
Users expect username/password

Decision: Rejected - too complex for current needs. Will add OAuth as option later.

Alternative 3: API Keys

Pros:

Simple implementation
Stateless
Easy to revoke

Cons:

No expiration (security risk)
Not suitable for user authentication
No claims/scopes

Decision: Rejected - better suited for programmatic access, not user auth

Consequences

Positive

Performance: Eliminate Redis lookup on every request
- Estimated improvement: 5-10ms per request
- Reduces Redis load by 80%
Scalability: Stateless servers
- No sticky sessions needed
- Easy horizontal scaling
- Multi-region deployment simplified
Mobile Support: Better mobile app experience
- Tokens stored locally
- No cookies required
- Offline token validation
Developer Experience: Simpler architecture
- No session middleware
- Easier testing (no session state)
- Clear token lifecycle

Negative

Token Revocation: Cannot immediately revoke access
- Mitigation: Short token expiration (15 min)
- Mitigation: Refresh token blacklist
- Mitigation: Emergency: force re-auth for all users
Token Size: JWTs larger than session IDs
- Session ID: ~32 bytes
- JWT: ~200 bytes
- Impact: Minimal (200 bytes per request is acceptable)
Secret Management: JWT secrets are critical
- Mitigation: Store in AWS Secrets Manager
- Mitigation: Rotate secrets quarterly
- Mitigation: Different secrets per environment
XSS Risk: Tokens accessible to JavaScript
- Mitigation: Store in httpOnly cookies (where possible)
- Mitigation: Strict Content Security Policy
- Mitigation: Short token expiration

Risks

Risk	Severity	Mitigation
JWT secret leaked	CRITICAL	Secrets Manager, rotation, monitoring
Cannot revoke compromised token	HIGH	Short expiration, refresh token blacklist
Algorithm confusion attack	MEDIUM	Explicitly specify algorithm in verification
Replay attacks	MEDIUM	Short expiration, HTTPS only

Implementation Plan

Phase 1: Preparation (Week 1-2)

Create JWT utility functions
Update authentication middleware
Add refresh token endpoint
Write migration guide for frontend team
Set up secrets in AWS Secrets Manager

Phase 2: Backend Migration (Week 3-4)

Deploy JWT endpoints alongside session endpoints
Add feature flag for JWT authentication
Comprehensive testing (unit + integration)
Load testing with JWTs
Security review

Phase 3: Frontend Migration (Week 5-6)

Update web app to use JWT
Update mobile apps to use JWT
Gradual rollout (10% → 50% → 100%)
Monitor error rates and performance

Phase 4: Cleanup (Week 7-8)

Remove session-based auth code
Remove Redis session storage
Update documentation
Postmortem and lessons learned

Rollback Plan

If critical issues arise:

Disable JWT feature flag
Route all traffic to session endpoints
Keep JWT code for investigation
Identify and fix issues
Resume migration

Metrics for Success

Performance:

Average request latency reduced by 5ms
p95 latency reduced by 10ms
Redis CPU usage reduced by 80%

Reliability:

No increase in authentication errors
<0.1% token validation failures
Zero security incidents

User Experience:

Login flow unchanged (transparent migration)
No increase in support tickets
Mobile app performance improved

Security Considerations

Token Security:

Tokens signed with HS256 (HMAC SHA256)
Secrets: 256-bit randomly generated
Secrets rotated quarterly
Algorithm specified in verification (prevent algorithm confusion)

Storage:

Web: httpOnly cookies (prevents XSS)
Mobile: Secure storage (Keychain/Keystore)
Never in localStorage (XSS vulnerable)

Transmission:

HTTPS only (TLS 1.3)
Secure, SameSite=Strict cookies
No tokens in URLs (log exposure)

Validation:

Verify signature
Check expiration
Validate issuer and audience
Check token not blacklisted (refresh tokens)

References

Updates

Date	Update	Author
2024-01-15	Initial ADR created	Bob
2024-01-20	Added security review feedback	Frank
2024-02-01	Updated after implementation	Bob

Status: Accepted Supersedes: ADR-008 (Session-based Authentication) Related: ADR-012 (API Security), ADR-014 (Multi-region Deployment)


## Usage Examples

@architecture-documenter @architecture-documenter --type system-overview @architecture-documenter --type adr @architecture-documenter --focus security @architecture-documenter --focus scalability @architecture-documenter --include-diagrams @architecture-documenter --update-existing


## Best Practices

### Document Architecture Decisions

**When to create an ADR**:
- Significant technical decisions
- Architecture changes
- Technology choices
- Process changes
- Security decisions

**ADR Structure**:
1. **Context**: What's the situation?
2. **Decision**: What did we decide?
3. **Alternatives**: What else did we consider?
4. **Consequences**: What are the impacts?

### Use Visual Diagrams

**Diagram Types**:
- **System Context**: Show system boundaries
- **Container**: Show high-level architecture
- **Component**: Show internal structure
- **Code**: Show class/module relationships
- **Deployment**: Show infrastructure
- **Sequence**: Show interactions over time

**Tools**:
- Diagrams as code: Mermaid, PlantUML
- Visual tools: Lucidchart, Draw.io
- Cloud-specific: AWS Architecture Diagrams

### Keep Documentation Current

**Documentation Lifecycle**:
- Create during design phase
- Review in code review
- Update with implementation changes
- Quarterly architecture review
- Archive outdated docs (don't delete)

**Version Control**:
- Store docs with code
- Version alongside releases
- Link docs to specific code versions
- Maintain changelog

### Make It Discoverable

**Organization**:
- Central location (wiki, docs folder)
- Clear naming conventions
- Table of contents
- Cross-references
- Search-friendly

**Accessibility**:
- Public within organization
- Easy to navigate
- Multiple entry points
- Links from README

## Notes

- Architecture documentation is for communication, not perfection
- Diagrams speak louder than words - use them liberally
- ADRs capture decisions and context for future reference
- Keep docs synchronized with code changes
- Version architecture docs alongside code
- Regular reviews prevent documentation drift
- Good architecture docs reduce onboarding time significantly
- Document the "why" not just the "what"
- Include trade-offs and alternatives considered
- Make security and scalability explicit
- Link architecture to business goals
- Use consistent notation and terminology

Quick Install

/plugin add https://github.com/CuriousLearner/devkit/tree/main/architecture-documenter

Copy and paste this command in Claude Code to install this skill

GitHub 仓库

CuriousLearner/devkit

Path: skills/architecture-documenter

Related Skills

langchain

Algorithmic Art Generation

webapp-testing

Testing

This Claude Skill provides a Playwright-based toolkit for testing local web applications through Python scripts. It enables frontend verification, UI debugging, screenshot capture, and log viewing while managing server lifecycles. Use it for browser automation tasks but run scripts directly rather than reading their source code to avoid context pollution.

View skill

requesting-code-review

Design

This skill dispatches a code-reviewer subagent to analyze code changes against requirements before proceeding. It should be used after completing tasks, implementing major features, or before merging to main. The review helps catch issues early by comparing the current implementation with the original plan.

View skill