deploy-edge-ai-model
Acerca de
Esta habilidad despliega modelos de ML en dispositivos de borde utilizando frameworks como TensorFlow Lite y ONNX Runtime, optimizándolos mediante cuantización y delegados de hardware. Permite inferencia en el dispositivo para escenarios donde la conectividad a la nube está limitada por latencia, costo o fiabilidad. Úsela para implementar en sistemas móviles, IoT o embebidos, con herramientas para evaluación de rendimiento y despliegue específico de plataforma.
Instalación rápida
Claude Code
Recomendadonpx skills add pjt222/agent-almanac -a claude-code/plugin add https://github.com/pjt222/agent-almanacgit clone https://github.com/pjt222/agent-almanac.git ~/.claude/skills/deploy-edge-ai-modelCopia y pega este comando en Claude Code para instalar esta habilidad
Documentación
Deploy Edge AI Model
See Extended Examples for complete configuration files, quantization scripts, and benchmark templates.
ML → edge devices. Optimized inference, HW accel, on-device mgmt.
Use When
- LLMs (Gemma 4, Phi, Llama) → mobile via Google AI Edge Gallery
- Convert → TFLite/ONNX for on-device
- Quantize → INT8/INT4, less mem + faster
- Android/iOS apps w/ local AI
- HW delegate select (GPU, NPU, DSP, Hexagon, CoreML)
- Bench latency + mem on target
- MediaPipe tasks → mobile/embedded
In
- Required: Trained model (SavedModel, PyTorch, ONNX, HF checkpoint)
- Required: Target platform (Android, iOS, Linux embedded, browser)
- Required: Device constraints (RAM, storage, compute)
- Optional: Calibration dataset → post-training quant
- Optional: AI Edge Gallery config → LLM deploy
- Optional: HW delegate prefs
Do
Step 1: Eval model → edge
Size, latency, device cap.
# assess_model.py
import os
import tensorflow as tf
def assess_model_for_edge(saved_model_path, target_ram_mb=4096):
"""Evaluate whether a model is suitable for edge deployment."""
model = tf.saved_model.load(saved_model_path)
# Check model size on disk
model_size_mb = sum(
os.path.getsize(os.path.join(dp, f))
for dp, _, filenames in os.walk(saved_model_path)
for f in filenames
) / (1024 * 1024)
print(f"Model size: {model_size_mb:.1f} MB")
print(f"Target RAM: {target_ram_mb} MB")
print(f"Size/RAM ratio: {model_size_mb / target_ram_mb:.2%}")
if model_size_mb > target_ram_mb * 0.25:
print("WARNING: Model exceeds 25% of device RAM - quantization recommended")
return False
return True
Decision matrix:
| Model Size | Device RAM | Recommended Action |
|---|---|---|
| < 50 MB | 2+ GB | Direct TFLite conversion |
| 50-500 MB | 4+ GB | INT8 quantization + TFLite |
| 500 MB-2 GB | 6+ GB | INT4 quantization + AI Edge Gallery |
| 2-4 GB | 8+ GB | Gemma 4 via AI Edge Gallery with INT4 |
| > 4 GB | 12+ GB | Weight streaming or cloud-edge hybrid |
→ Assessment done, size/RAM ratios, quant recommendation by constraints.
If err: SavedModel path valid (ls saved_model/), TF installed (python -c "import tensorflow"), disk space OK, format supported.
Step 2: LLMs via Google AI Edge Gallery
Gemma 4 + LLMs → Android.
# Clone AI Edge Gallery
git clone https://github.com/nickoala/ai-edge-gallery.git
cd ai-edge-gallery
# Build the Android app
./gradlew assembleDebug
# Install on connected device
adb install -r app/build/outputs/apk/debug/app-debug.apk
Gemma 4 config:
{
"models": [
{
"name": "Gemma 4 2B IT",
"url": "https://huggingface.co/google/gemma-4-2b-it-gpu-int4",
"format": "tflite",
"backend": "gpu",
"config": {
"max_tokens": 1024,
"temperature": 0.7,
"top_k": 40,
"top_p": 0.95
}
},
{
"name": "Gemma 4 4B IT",
"url": "https://huggingface.co/google/gemma-4-4b-it-gpu-int4",
"format": "tflite",
"backend": "gpu",
"config": {
"max_tokens": 2048,
"temperature": 0.7
}
}
]
}
Programmatic inference w/ LLM Inference API:
# gemma_edge_inference.py
from mediapipe.tasks.genai import llm_inference
# Configure the LLM
options = llm_inference.LlmInferenceOptions(
model_path="/data/local/tmp/gemma-4-2b-it-int4.tflite",
max_tokens=512,
temperature=0.7,
top_k=40,
supported_lora_ranks=[4, 8, 16] # Optional LoRA support
)
# Create inference engine
engine = llm_inference.LlmInference(options=options)
# Run inference
response = engine.generate_response("Explain edge computing in one sentence.")
print(response)
# Streaming inference
for chunk in engine.generate_response_async("List three benefits of on-device AI."):
print(chunk, end="", flush=True)
→ App builds+installs, Gemma 4 downloads, coherent responses, GPU delegate active.
If err: SDK ≥ 26 (adb shell getprop ro.build.version.sdk), device storage OK, GPU delegate supported (adb logcat | grep -i delegate), HF access, ADB connection (adb devices).
Step 3: Convert + quantize w/ TFLite
Standard → TFLite w/ post-training quant.
# convert_tflite.py
import os
import tensorflow as tf
import numpy as np
def convert_to_tflite(saved_model_path, output_path, quantization="dynamic"):
"""Convert SavedModel to TFLite with quantization."""
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_path)
if quantization == "dynamic":
converter.optimizations = [tf.lite.Optimize.DEFAULT]
elif quantization == "int8":
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_ops = [
tf.lite.OpsSet.TFLITE_BUILTINS_INT8
]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8
# Representative dataset for calibration
def representative_dataset():
for _ in range(100):
yield [np.random.randn(1, 224, 224, 3).astype(np.float32)]
converter.representative_dataset = representative_dataset
elif quantization == "float16":
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.float16]
tflite_model = converter.convert()
with open(output_path, "wb") as f:
f.write(tflite_model)
original_size = sum(
os.path.getsize(os.path.join(dp, f))
for dp, _, filenames in os.walk(saved_model_path)
for f in filenames
) / (1024 * 1024)
quantized_size = len(tflite_model) / (1024 * 1024)
print(f"Original: {original_size:.1f} MB -> Quantized: {quantized_size:.1f} MB")
print(f"Compression ratio: {original_size / quantized_size:.1f}x")
# Usage
convert_to_tflite("saved_model/", "model_int8.tflite", quantization="int8")
ONNX Runtime quant alt:
# quantize_onnx.py
from onnxruntime.quantization import quantize_dynamic, quantize_static, QuantType
# Dynamic quantization (no calibration data needed)
quantize_dynamic(
model_input="model.onnx",
model_output="model_int8.onnx",
weight_type=QuantType.QInt8
)
# Static quantization (better accuracy, needs calibration)
# ... (see EXAMPLES.md for complete calibration workflow)
→ TFLite gen'd, size -2-4x w/ INT8, accuracy within 1-2%, ONNX quant valid.
If err: TF ≥ 2.15, rep dataset matches input shape, all ops supported (converter.allow_custom_ops = True fallback), ONNX opset compat.
Step 4: HW delegates
Select + config.
# configure_delegates.py
import tensorflow as tf
def create_interpreter_with_delegate(model_path, delegate="gpu"):
"""Create TFLite interpreter with hardware delegate."""
if delegate == "gpu":
delegate_obj = tf.lite.experimental.load_delegate(
"libtensorflowlite_gpu_delegate.so",
options={"precision": "fp16", "allow_quantized_models": "true"}
)
elif delegate == "nnapi":
# Android Neural Networks API - routes to NPU/DSP
delegate_obj = tf.lite.experimental.load_delegate(
"libtensorflowlite_nnapi_delegate.so"
)
elif delegate == "xnnpack":
# Optimized CPU inference
delegate_obj = None # XNNPACK is default in TFLite
interpreter = tf.lite.Interpreter(
model_path=model_path,
experimental_delegates=[delegate_obj] if delegate_obj else None,
num_threads=4
)
interpreter.allocate_tensors()
return interpreter
Delegate guide:
| Device | Best Delegate | Fallback | Notes |
|---|---|---|---|
| Android (Qualcomm) | NNAPI -> Hexagon DSP | GPU -> XNNPACK | Check nnapi_accelerator_name |
| Android (MediaTek) | NNAPI -> APU | GPU -> XNNPACK | Dimensity chips have dedicated APU |
| Android (Samsung) | NNAPI -> NPU | GPU -> XNNPACK | Exynos NPU via NNAPI |
| iOS | CoreML delegate | Metal GPU | Use coreml_delegate for ANE |
| Linux embedded | GPU (if available) | XNNPACK | RPi uses XNNPACK CPU |
| Browser | WebGL / WebGPU | WASM SIMD | Via TensorFlow.js |
→ Delegate loads, inference on accel, latency 2-10x vs CPU-only.
If err: Lib on device, delegate supported (adb shell cat /proc/cpuinfo), fall back XNNPACK, OpenCL for GPU, NNAPI ver.
Step 5: Bench on-device
Latency, mem, power.
# Use TFLite benchmark tool
adb push model_int8.tflite /data/local/tmp/
# CPU benchmark
adb shell /data/local/tmp/benchmark_model \
--graph=/data/local/tmp/model_int8.tflite \
--num_threads=4 \
--num_runs=50 \
--warmup_runs=5
# GPU benchmark
adb shell /data/local/tmp/benchmark_model \
--graph=/data/local/tmp/model_int8.tflite \
--use_gpu=true \
--num_runs=50
# NNAPI benchmark
adb shell /data/local/tmp/benchmark_model \
--graph=/data/local/tmp/model_int8.tflite \
--use_nnapi=true \
--nnapi_accelerator_name=google-edgetpu \
--num_runs=50
Python bench:
# benchmark_edge.py
import time
import numpy as np
import psutil
def benchmark_inference(interpreter, input_data, num_runs=100):
"""Benchmark TFLite model inference."""
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
# Warmup
for _ in range(10):
interpreter.set_tensor(input_details[0]["index"], input_data)
interpreter.invoke()
# Benchmark
latencies = []
mem_before = psutil.Process().memory_info().rss / (1024 * 1024)
for _ in range(num_runs):
start = time.perf_counter()
interpreter.set_tensor(input_details[0]["index"], input_data)
interpreter.invoke()
latencies.append((time.perf_counter() - start) * 1000)
mem_after = psutil.Process().memory_info().rss / (1024 * 1024)
print(f"Latency (p50): {np.percentile(latencies, 50):.1f} ms")
print(f"Latency (p95): {np.percentile(latencies, 95):.1f} ms")
print(f"Latency (p99): {np.percentile(latencies, 99):.1f} ms")
print(f"Memory delta: {mem_after - mem_before:.1f} MB")
print(f"Throughput: {1000 / np.mean(latencies):.1f} inferences/sec")
→ Latency percentiles + mem + throughput. GPU 2-5x vs CPU. Gemma 4 2B → 10-30 tok/sec flagship.
If err: Bench binary matches arch (arm64-v8a), model pushed (adb shell ls /data/local/tmp/), storage OK, kill bg apps, thermal throttle check (adb shell cat /sys/class/thermal/thermal_zone*/temp).
Step 6: Package → prod
Mobile app w/ embedded/downloadable model.
// Android: EdgeAIManager.kt
import com.google.mediapipe.tasks.genai.llminference.LlmInference
class EdgeAIManager(private val context: Context) {
private var llmInference: LlmInference? = null
fun initialize(modelPath: String) {
val options = LlmInference.LlmInferenceOptions.builder()
.setModelPath(modelPath)
.setMaxTokens(512)
.setTemperature(0.7f)
.setTopK(40)
.setResultListener { result, done ->
// Handle streaming tokens
onTokenReceived(result, done)
}
.build()
llmInference = LlmInference.createFromOptions(context, options)
}
fun generateResponse(prompt: String): String {
return llmInference?.generateResponse(prompt)
?: throw IllegalStateException("Model not initialized")
}
fun release() {
llmInference?.close()
llmInference = null
}
}
Download + cache:
// ModelDownloader.kt
class ModelDownloader(private val context: Context) {
private val modelDir = File(context.filesDir, "models")
suspend fun ensureModel(modelName: String, url: String): File {
val modelFile = File(modelDir, modelName)
if (modelFile.exists()) return modelFile
modelDir.mkdirs()
// Download with progress tracking
// ... (see EXAMPLES.md for complete implementation)
return modelFile
}
}
→ App builds w/ MediaPipe, model loads first launch, latency OK, cached after download, fallback on unsupported.
If err: minSdk ≥ 26, MediaPipe dep ver, model SHA256, storage, ProGuard preserves MediaPipe classes, test multi-device.
Check
- Model → TFLite/ONNX w/o op errs
- Quant accuracy < 2% degrade
- HW delegate loads + accels
- Latency meets target (< 100ms vision, < 50ms/tok LLM)
- Mem within budget
- AI Edge Gallery runs Gemma 4
- On-device LLM coherent
- App handles download/cache/update
- Graceful degrade on unsupported
- Battery acceptable
Traps
- Unsupported TFLite ops: Custom ops fail →
converter.allow_custom_ops = Trueor replace, check compat list - Quant accuracy loss: INT4 degrades sensitive → mixed precision, calibrate w/ rep data
- Delegate init fail: GPU crashes old devices → CPU fallback, check compat
- Mem pressure: Model + app > RAM → memory-mapped, unload, batch=1
- Thermal throttle: Sustained inference → overheat → duty cycle, reduce freq, monitor zones
- Download size: Large over cellular → Wi-Fi-only, resumable, progressive
- Version fragmentation: Works some not others → device matrix test, NNAPI ver checks, compat DB
→
deploy-ml-model-serving— cloud serving (complement to edge)monitor-model-drift— quality over timeregister-ml-model— register before edge deploycreate-dockerfile— containerize conversion pipelinecreate-multistage-dockerfile— multi-stage builds
Repositorio GitHub
Habilidades relacionadas
qmd
Desarrolloqmd es una herramienta CLI de búsqueda e indexación local que permite a los desarrolladores indexar y buscar en archivos locales mediante búsqueda híbrida que combina BM25, embeddings vectoriales y reranking. Es compatible tanto con uso desde la línea de comandos como con modo MCP (Model Context Protocol) para integración con Claude. La herramienta utiliza Ollama para los embeddings y almacena los índices localmente, lo que la hace ideal para buscar documentación o bases de código directamente desde la terminal.
subagent-driven-development
DesarrolloEsta habilidad ejecuta planes de implementación asignando un nuevo subagente para cada tarea independiente, con revisión de código entre tareas. Permite una iteración rápida mientras mantiene controles de calidad a través de este proceso de revisión. Úsala cuando trabajes en tareas mayormente independientes dentro de la misma sesión para garantizar un progreso continuo con verificaciones de calidad integradas.
mcporter
DesarrolloLa habilidad mcporter permite a los desarrolladores gestionar y llamar servidores del Protocolo de Contexto de Modelo (MCP) directamente desde Claude. Proporciona comandos para listar servidores disponibles, llamar a sus herramientas con argumentos, y manejar la autenticación y el ciclo de vida del daemon. Utiliza esta habilidad para integrar y probar la funcionalidad de servidores MCP en tu flujo de trabajo de desarrollo.
adk-deployment-specialist
DesarrolloEsta habilidad despliega y orquesta agentes Vertex AI ADK utilizando el protocolo A2A, gestionando el descubrimiento de AgentCard, el envío de tareas y soportando herramientas como el Sandbox de Ejecución de Código y el Banco de Memoria. Permite construir sistemas multiagente con patrones de orquestación secuencial, paralela o en bucle en Python, Java o Go. Úsela cuando se le solicite desplegar agentes ADK u orquestar flujos de trabajo de agentes en Google Cloud.
