modal-serverless-gpu

davila7

Updated Today

23 views

15,516

1,344

15,516

View on GitHub

DevelopmentInfrastructureServerlessGPUCloudDeploymentModal

About

Modal provides a serverless GPU cloud platform for running ML workloads without infrastructure management. It enables deploying models as auto-scaling APIs and running batch jobs with pay-per-second pricing. Key features include on-demand access to various GPUs (T4 to H100) and a Python-native interface for defining compute tasks.

Quick Install

Claude Code

Recommended

Plugin CommandRecommended

/plugin add https://github.com/davila7/claude-code-templates

Git CloneAlternative

git clone https://github.com/davila7/claude-code-templates.git ~/.claude/skills/modal-serverless-gpu

Copy and paste this command in Claude Code to install this skill

Documentation

Modal Serverless GPU

Comprehensive guide to running ML workloads on Modal's serverless GPU cloud platform.

When to use Modal

Use Modal when:

Running GPU-intensive ML workloads without managing infrastructure
Deploying ML models as auto-scaling APIs
Running batch processing jobs (training, inference, data processing)
Need pay-per-second GPU pricing without idle costs
Prototyping ML applications quickly
Running scheduled jobs (cron-like workloads)

Key features:

Serverless GPUs: T4, L4, A10G, L40S, A100, H100, H200, B200 on-demand
Python-native: Define infrastructure in Python code, no YAML
Auto-scaling: Scale to zero, scale to 100+ GPUs instantly
Sub-second cold starts: Rust-based infrastructure for fast container launches
Container caching: Image layers cached for rapid iteration
Web endpoints: Deploy functions as REST APIs with zero-downtime updates

Use alternatives instead:

RunPod: For longer-running pods with persistent state
Lambda Labs: For reserved GPU instances
SkyPilot: For multi-cloud orchestration and cost optimization
Kubernetes: For complex multi-service architectures

Quick start

Installation

pip install modal
modal setup  # Opens browser for authentication

Hello World with GPU

import modal

app = modal.App("hello-gpu")

@app.function(gpu="T4")
def gpu_info():
    import subprocess
    return subprocess.run(["nvidia-smi"], capture_output=True, text=True).stdout

@app.local_entrypoint()
def main():
    print(gpu_info.remote())

Run: modal run hello_gpu.py

Basic inference endpoint

import modal

app = modal.App("text-generation")
image = modal.Image.debian_slim().pip_install("transformers", "torch", "accelerate")

@app.cls(gpu="A10G", image=image)
class TextGenerator:
    @modal.enter()
    def load_model(self):
        from transformers import pipeline
        self.pipe = pipeline("text-generation", model="gpt2", device=0)

    @modal.method()
    def generate(self, prompt: str) -> str:
        return self.pipe(prompt, max_length=100)[0]["generated_text"]

@app.local_entrypoint()
def main():
    print(TextGenerator().generate.remote("Hello, world"))

Core concepts

Key components

Component	Purpose
`App`	Container for functions and resources
`Function`	Serverless function with compute specs
`Cls`	Class-based functions with lifecycle hooks
`Image`	Container image definition
`Volume`	Persistent storage for models/data
`Secret`	Secure credential storage

Execution modes

Command	Description
`modal run script.py`	Execute and exit
`modal serve script.py`	Development with live reload
`modal deploy script.py`	Persistent cloud deployment

GPU configuration

Available GPUs

GPU	VRAM	Best For
`T4`	16GB	Budget inference, small models
`L4`	24GB	Inference, Ada Lovelace arch
`A10G`	24GB	Training/inference, 3.3x faster than T4
`L40S`	48GB	Recommended for inference (best cost/perf)
`A100-40GB`	40GB	Large model training
`A100-80GB`	80GB	Very large models
`H100`	80GB	Fastest, FP8 + Transformer Engine
`H200`	141GB	Auto-upgrade from H100, 4.8TB/s bandwidth
`B200`	Latest	Blackwell architecture

GPU specification patterns

# Single GPU
@app.function(gpu="A100")

# Specific memory variant
@app.function(gpu="A100-80GB")

# Multiple GPUs (up to 8)
@app.function(gpu="H100:4")

# GPU with fallbacks
@app.function(gpu=["H100", "A100", "L40S"])

# Any available GPU
@app.function(gpu="any")

Container images

# Basic image with pip
image = modal.Image.debian_slim(python_version="3.11").pip_install(
    "torch==2.1.0", "transformers==4.36.0", "accelerate"
)

# From CUDA base
image = modal.Image.from_registry(
    "nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04",
    add_python="3.11"
).pip_install("torch", "transformers")

# With system packages
image = modal.Image.debian_slim().apt_install("git", "ffmpeg").pip_install("whisper")

Persistent storage

volume = modal.Volume.from_name("model-cache", create_if_missing=True)

@app.function(gpu="A10G", volumes={"/models": volume})
def load_model():
    import os
    model_path = "/models/llama-7b"
    if not os.path.exists(model_path):
        model = download_model()
        model.save_pretrained(model_path)
        volume.commit()  # Persist changes
    return load_from_path(model_path)

Web endpoints

FastAPI endpoint decorator

@app.function()
@modal.fastapi_endpoint(method="POST")
def predict(text: str) -> dict:
    return {"result": model.predict(text)}

Full ASGI app

from fastapi import FastAPI
web_app = FastAPI()

@web_app.post("/predict")
async def predict(text: str):
    return {"result": await model.predict.remote.aio(text)}

@app.function()
@modal.asgi_app()
def fastapi_app():
    return web_app

Web endpoint types

Decorator	Use Case
`@modal.fastapi_endpoint()`	Simple function → API
`@modal.asgi_app()`	Full FastAPI/Starlette apps
`@modal.wsgi_app()`	Django/Flask apps
`@modal.web_server(port)`	Arbitrary HTTP servers

Dynamic batching

@app.function()
@modal.batched(max_batch_size=32, wait_ms=100)
async def batch_predict(inputs: list[str]) -> list[dict]:
    # Inputs automatically batched
    return model.batch_predict(inputs)

Secrets management

# Create secret
modal secret create huggingface HF_TOKEN=hf_xxx

@app.function(secrets=[modal.Secret.from_name("huggingface")])
def download_model():
    import os
    token = os.environ["HF_TOKEN"]

Scheduling

@app.function(schedule=modal.Cron("0 0 * * *"))  # Daily midnight
def daily_job():
    pass

@app.function(schedule=modal.Period(hours=1))
def hourly_job():
    pass

Performance optimization

Cold start mitigation

@app.function(
    container_idle_timeout=300,  # Keep warm 5 min
    allow_concurrent_inputs=10,  # Handle concurrent requests
)
def inference():
    pass

Model loading best practices

@app.cls(gpu="A100")
class Model:
    @modal.enter()  # Run once at container start
    def load(self):
        self.model = load_model()  # Load during warm-up

    @modal.method()
    def predict(self, x):
        return self.model(x)

Parallel processing

@app.function()
def process_item(item):
    return expensive_computation(item)

@app.function()
def run_parallel():
    items = list(range(1000))
    # Fan out to parallel containers
    results = list(process_item.map(items))
    return results

Common configuration

@app.function(
    gpu="A100",
    memory=32768,              # 32GB RAM
    cpu=4,                     # 4 CPU cores
    timeout=3600,              # 1 hour max
    container_idle_timeout=120,# Keep warm 2 min
    retries=3,                 # Retry on failure
    concurrency_limit=10,      # Max concurrent containers
)
def my_function():
    pass

Debugging

# Test locally
if __name__ == "__main__":
    result = my_function.local()

# View logs
# modal app logs my-app

Common issues

Issue	Solution
Cold start latency	Increase `container_idle_timeout`, use `@modal.enter()`
GPU OOM	Use larger GPU (`A100-80GB`), enable gradient checkpointing
Image build fails	Pin dependency versions, check CUDA compatibility
Timeout errors	Increase `timeout`, add checkpointing

References

Advanced Usage - Multi-GPU, distributed training, cost optimization
Troubleshooting - Common issues and solutions

Resources

Documentation: https://modal.com/docs
Examples: https://github.com/modal-labs/modal-examples
Pricing: https://modal.com/pricing
Discord: https://discord.gg/modal

GitHub Repository

davila7/claude-code-templates

Path: cli-tool/components/skills/ai-research/infrastructure-modal

anthropicanthropic-claudeclaudeclaude-code

Related Skills

railway-deployment

Meta

This Claude Skill manages Railway deployments for lifecycle operations and troubleshooting. It enables developers to view logs, redeploy, restart, or remove deployments through Railway CLI commands. Use it for deployment visibility and debugging, but note that deleting services requires the railway-environment skill instead.

View skill

railway-database

Meta

This skill adds official Railway database services (Postgres, Redis, MySQL, MongoDB) with pre-configured volumes and connection variables. Use it when developers request to add, connect, or wire up databases in their Railway projects. It specifically handles database services while directing other templates to the separate railway-templates skill.

View skill

railway-status

Meta

This skill checks the current deployment status and uptime of Railway projects in the current directory. It's triggered by queries like "railway status," "what's deployed," or questions about deployment status and uptime. Use the separate railway-environment skill for configuration or variable queries instead.

View skill

railway-templates

Meta

This skill enables developers to search for and deploy pre-configured services from Railway's template marketplace, such as Ghost, Strapi, and n8n. Use it when you need to quickly add a templated service or find templates for a specific use case like CMS or monitoring. For core databases, the separate railway-database skill is preferred.

View skill