deploy-edge-ai-model
정보
이 스킬은 TensorFlow Lite와 ONNX Runtime 같은 프레임워크를 사용하여 모바일 폰과 IoT 하드웨어 같은 엣지 디바이스에 머신러닝 모델을 배포할 수 있게 해줍니다. 모델 양자화, 하드웨어 대리자 선택, 제한된 환경을 위한 성능 벤치마킹 등 핵심 단계를 다룹니다. 지연 시간, 비용 또는 연결성 요구 사항으로 인해 클라우드 추론이 적합하지 않은 경우에 사용하세요.
빠른 설치
Claude Code
추천npx skills add pjt222/agent-almanac -a claude-code/plugin add https://github.com/pjt222/agent-almanacgit clone https://github.com/pjt222/agent-almanac.git ~/.claude/skills/deploy-edge-ai-modelClaude Code에서 이 명령을 복사하여 붙여넣어 스킬을 설치하세요
문서
Deploy Edge AI Model
See Extended Examples for complete configuration files, quantization scripts, and benchmark templates.
Deploy ML models to edge devices with optimized inference, hardware acceleration, on-device model management.
When Use
- Deploy LLMs (Gemma 4, Phi, Llama) to mobile devices via Google AI Edge Gallery
- Convert models to TensorFlow Lite or ONNX for on-device inference
- Quantize models to INT8/INT4 for reduced memory and faster inference
- Build Android/iOS apps with local AI capabilities
- Pick hardware delegates (GPU, NPU, DSP, Hexagon, CoreML)
- Benchmark inference latency and memory on target devices
- Deploy MediaPipe tasks (vision, text, audio) to mobile or embedded platforms
Inputs
- Required: Trained model (SavedModel, PyTorch, ONNX, or Hugging Face checkpoint)
- Required: Target platform (Android, iOS, Linux embedded, browser)
- Required: Target device constraints (RAM, storage, compute capability)
- Optional: Calibration dataset for post-training quantization
- Optional: Google AI Edge Gallery configuration for LLM deployment
- Optional: Hardware delegate preferences (GPU, NPU, CPU-only)
Steps
Step 1: Evaluate Model for Edge Deployment
Assess model size, latency requirements, target device capabilities.
# assess_model.py
import os
import tensorflow as tf
def assess_model_for_edge(saved_model_path, target_ram_mb=4096):
"""Evaluate whether a model is suitable for edge deployment."""
model = tf.saved_model.load(saved_model_path)
# Check model size on disk
model_size_mb = sum(
os.path.getsize(os.path.join(dp, f))
for dp, _, filenames in os.walk(saved_model_path)
for f in filenames
) / (1024 * 1024)
print(f"Model size: {model_size_mb:.1f} MB")
print(f"Target RAM: {target_ram_mb} MB")
print(f"Size/RAM ratio: {model_size_mb / target_ram_mb:.2%}")
if model_size_mb > target_ram_mb * 0.25:
print("WARNING: Model exceeds 25% of device RAM - quantization recommended")
return False
return True
Edge deployment decision matrix:
| Model Size | Device RAM | Recommended Action |
|---|---|---|
| < 50 MB | 2+ GB | Direct TFLite conversion |
| 50-500 MB | 4+ GB | INT8 quantization + TFLite |
| 500 MB-2 GB | 6+ GB | INT4 quantization + AI Edge Gallery |
| 2-4 GB | 8+ GB | Gemma 4 via AI Edge Gallery with INT4 |
| > 4 GB | 12+ GB | Weight streaming or cloud-edge hybrid |
Got: Model assessment completes. Size and RAM ratios calculated. Quantization recommendation generated based on device constraints.
If fail: Verify SavedModel path is valid (ls saved_model/), check TensorFlow installation (python -c "import tensorflow"), ensure sufficient disk space for model loading, verify model format supported.
Step 2: Deploy LLMs via Google AI Edge Gallery
Use Google AI Edge Gallery to deploy Gemma 4 and other LLMs to Android devices.
# Clone AI Edge Gallery
git clone https://github.com/nickoala/ai-edge-gallery.git
cd ai-edge-gallery
# Build the Android app
./gradlew assembleDebug
# Install on connected device
adb install -r app/build/outputs/apk/debug/app-debug.apk
Configure Gemma 4 model for AI Edge Gallery:
{
"models": [
{
"name": "Gemma 4 2B IT",
"url": "https://huggingface.co/google/gemma-4-2b-it-gpu-int4",
"format": "tflite",
"backend": "gpu",
"config": {
"max_tokens": 1024,
"temperature": 0.7,
"top_k": 40,
"top_p": 0.95
}
},
{
"name": "Gemma 4 4B IT",
"url": "https://huggingface.co/google/gemma-4-4b-it-gpu-int4",
"format": "tflite",
"backend": "gpu",
"config": {
"max_tokens": 2048,
"temperature": 0.7
}
}
]
}
Programmatic on-device inference with LLM Inference API:
# gemma_edge_inference.py
from mediapipe.tasks.genai import llm_inference
# Configure the LLM
options = llm_inference.LlmInferenceOptions(
model_path="/data/local/tmp/gemma-4-2b-it-int4.tflite",
max_tokens=512,
temperature=0.7,
top_k=40,
supported_lora_ranks=[4, 8, 16] # Optional LoRA support
)
# Create inference engine
engine = llm_inference.LlmInference(options=options)
# Run inference
response = engine.generate_response("Explain edge computing in one sentence.")
print(response)
# Streaming inference
for chunk in engine.generate_response_async("List three benefits of on-device AI."):
print(chunk, end="", flush=True)
Got: AI Edge Gallery app builds and installs. Gemma 4 model downloads to device. On-device inference produces coherent responses. GPU delegate activates for acceleration.
If fail: Check Android SDK version >= 26 (adb shell getprop ro.build.version.sdk), verify device has sufficient storage for model download, ensure GPU delegate supported (adb logcat | grep -i delegate), check Hugging Face model access permissions, verify ADB connection (adb devices).
Step 3: Convert and Quantize Models with TFLite
Convert standard models to TFLite format with post-training quantization.
# convert_tflite.py
import os
import tensorflow as tf
import numpy as np
def convert_to_tflite(saved_model_path, output_path, quantization="dynamic"):
"""Convert SavedModel to TFLite with quantization."""
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_path)
if quantization == "dynamic":
converter.optimizations = [tf.lite.Optimize.DEFAULT]
elif quantization == "int8":
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_ops = [
tf.lite.OpsSet.TFLITE_BUILTINS_INT8
]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8
# Representative dataset for calibration
def representative_dataset():
for _ in range(100):
yield [np.random.randn(1, 224, 224, 3).astype(np.float32)]
converter.representative_dataset = representative_dataset
elif quantization == "float16":
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.float16]
tflite_model = converter.convert()
with open(output_path, "wb") as f:
f.write(tflite_model)
original_size = sum(
os.path.getsize(os.path.join(dp, f))
for dp, _, filenames in os.walk(saved_model_path)
for f in filenames
) / (1024 * 1024)
quantized_size = len(tflite_model) / (1024 * 1024)
print(f"Original: {original_size:.1f} MB -> Quantized: {quantized_size:.1f} MB")
print(f"Compression ratio: {original_size / quantized_size:.1f}x")
# Usage
convert_to_tflite("saved_model/", "model_int8.tflite", quantization="int8")
ONNX Runtime quantization alternative:
# quantize_onnx.py
from onnxruntime.quantization import quantize_dynamic, quantize_static, QuantType
# Dynamic quantization (no calibration data needed)
quantize_dynamic(
model_input="model.onnx",
model_output="model_int8.onnx",
weight_type=QuantType.QInt8
)
# Static quantization (better accuracy, needs calibration)
# ... (see EXAMPLES.md for complete calibration workflow)
Got: TFLite model generated at specified path. Model size reduced by 2-4x with INT8. Inference accuracy within 1-2% of original. ONNX quantization produces valid model.
If fail: Check TensorFlow version >= 2.15 for latest quantization support, verify representative dataset matches model input shape, ensure all ops supported in TFLite (converter.allow_custom_ops = True as fallback), check ONNX opset version compatibility.
Step 4: Configure Hardware Delegates
Pick and configure hardware acceleration delegates for target devices.
# configure_delegates.py
import tensorflow as tf
def create_interpreter_with_delegate(model_path, delegate="gpu"):
"""Create TFLite interpreter with hardware delegate."""
if delegate == "gpu":
delegate_obj = tf.lite.experimental.load_delegate(
"libtensorflowlite_gpu_delegate.so",
options={"precision": "fp16", "allow_quantized_models": "true"}
)
elif delegate == "nnapi":
# Android Neural Networks API - routes to NPU/DSP
delegate_obj = tf.lite.experimental.load_delegate(
"libtensorflowlite_nnapi_delegate.so"
)
elif delegate == "xnnpack":
# Optimized CPU inference
delegate_obj = None # XNNPACK is default in TFLite
interpreter = tf.lite.Interpreter(
model_path=model_path,
experimental_delegates=[delegate_obj] if delegate_obj else None,
num_threads=4
)
interpreter.allocate_tensors()
return interpreter
Delegate pick guide:
| Device | Best Delegate | Fallback | Notes |
|---|---|---|---|
| Android (Qualcomm) | NNAPI -> Hexagon DSP | GPU -> XNNPACK | Check nnapi_accelerator_name |
| Android (MediaTek) | NNAPI -> APU | GPU -> XNNPACK | Dimensity chips have dedicated APU |
| Android (Samsung) | NNAPI -> NPU | GPU -> XNNPACK | Exynos NPU via NNAPI |
| iOS | CoreML delegate | Metal GPU | Use coreml_delegate for ANE |
| Linux embedded | GPU (if available) | XNNPACK | RPi uses XNNPACK CPU |
| Browser | WebGL / WebGPU | WASM SIMD | Via TensorFlow.js |
Got: Delegate loads without errors. Inference runs on target accelerator. Latency improves 2-10x over CPU-only depending on model and device.
If fail: Verify delegate library exists on device, check device supports requested delegate (adb shell cat /proc/cpuinfo for CPU features), fall back to XNNPACK if GPU/NPU unavailable, check OpenCL support for GPU delegate, verify NNAPI version (adb shell getprop ro.android.ndk.version).
Step 5: Benchmark On-Device Performance
Measure inference latency, memory usage, power consumption on target devices.
# Use TFLite benchmark tool
adb push model_int8.tflite /data/local/tmp/
# CPU benchmark
adb shell /data/local/tmp/benchmark_model \
--graph=/data/local/tmp/model_int8.tflite \
--num_threads=4 \
--num_runs=50 \
--warmup_runs=5
# GPU benchmark
adb shell /data/local/tmp/benchmark_model \
--graph=/data/local/tmp/model_int8.tflite \
--use_gpu=true \
--num_runs=50
# NNAPI benchmark
adb shell /data/local/tmp/benchmark_model \
--graph=/data/local/tmp/model_int8.tflite \
--use_nnapi=true \
--nnapi_accelerator_name=google-edgetpu \
--num_runs=50
Python benchmarking:
# benchmark_edge.py
import time
import numpy as np
import psutil
def benchmark_inference(interpreter, input_data, num_runs=100):
"""Benchmark TFLite model inference."""
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
# Warmup
for _ in range(10):
interpreter.set_tensor(input_details[0]["index"], input_data)
interpreter.invoke()
# Benchmark
latencies = []
mem_before = psutil.Process().memory_info().rss / (1024 * 1024)
for _ in range(num_runs):
start = time.perf_counter()
interpreter.set_tensor(input_details[0]["index"], input_data)
interpreter.invoke()
latencies.append((time.perf_counter() - start) * 1000)
mem_after = psutil.Process().memory_info().rss / (1024 * 1024)
print(f"Latency (p50): {np.percentile(latencies, 50):.1f} ms")
print(f"Latency (p95): {np.percentile(latencies, 95):.1f} ms")
print(f"Latency (p99): {np.percentile(latencies, 99):.1f} ms")
print(f"Memory delta: {mem_after - mem_before:.1f} MB")
print(f"Throughput: {1000 / np.mean(latencies):.1f} inferences/sec")
Got: Benchmark produces latency percentiles, memory usage, throughput metrics. GPU delegate shows 2-5x speedup over CPU for vision models. Gemma 4 2B hits 10-30 tokens/sec on flagship phones.
If fail: Ensure benchmark binary matches device architecture (arm64-v8a), verify model pushed to device (adb shell ls /data/local/tmp/), check sufficient device storage, kill background apps to reduce memory pressure, verify thermal throttling not active (adb shell cat /sys/class/thermal/thermal_zone*/temp).
Step 6: Package for Production Deployment
Build final mobile application with embedded or downloadable model.
// Android: EdgeAIManager.kt
import com.google.mediapipe.tasks.genai.llminference.LlmInference
class EdgeAIManager(private val context: Context) {
private var llmInference: LlmInference? = null
fun initialize(modelPath: String) {
val options = LlmInference.LlmInferenceOptions.builder()
.setModelPath(modelPath)
.setMaxTokens(512)
.setTemperature(0.7f)
.setTopK(40)
.setResultListener { result, done ->
// Handle streaming tokens
onTokenReceived(result, done)
}
.build()
llmInference = LlmInference.createFromOptions(context, options)
}
fun generateResponse(prompt: String): String {
return llmInference?.generateResponse(prompt)
?: throw IllegalStateException("Model not initialized")
}
fun release() {
llmInference?.close()
llmInference = null
}
}
Model download and caching strategy:
// ModelDownloader.kt
class ModelDownloader(private val context: Context) {
private val modelDir = File(context.filesDir, "models")
suspend fun ensureModel(modelName: String, url: String): File {
val modelFile = File(modelDir, modelName)
if (modelFile.exists()) return modelFile
modelDir.mkdirs()
// Download with progress tracking
// ... (see EXAMPLES.md for complete implementation)
return modelFile
}
}
Got: Android app builds with MediaPipe dependency. Model loads on first launch. Inference runs within latency budget. Model cached after first download. Graceful fallback when device unsupported.
If fail: Check minSdk >= 26 in build.gradle, verify MediaPipe dependency version, ensure model file not corrupted (check SHA256), verify sufficient device storage for model, check ProGuard rules preserve MediaPipe classes, test on multiple device tiers.
Checks
- Model converts to TFLite/ONNX without op compatibility errors
- Quantized model accuracy within acceptable tolerance (< 2% degradation)
- Hardware delegate loads and accelerates inference
- Benchmark latency hits target (e.g., < 100ms for vision, < 50ms/token for LLM)
- Memory usage stays within device budget
- AI Edge Gallery loads and runs Gemma 4 model
- On-device LLM generates coherent responses
- Application handles model download, caching, updates
- Graceful degradation on unsupported devices
- Battery impact within acceptable range for target use case
Pitfalls
- Unsupported ops in TFLite: Custom ops fail conversion - use
converter.allow_custom_ops = Trueor replace with supported alternatives, check op compatibility list - Quantization accuracy loss: INT4 degrades quality for sensitive tasks - use mixed precision, calibrate with representative data, evaluate on edge-specific test set
- Delegate initialization failure: GPU delegate crashes on older devices - always implement CPU fallback, check delegate compatibility before loading
- Memory pressure on device: Model + app exceeds available RAM - use memory-mapped models, implement model unloading, reduce batch size to 1
- Thermal throttling: Sustained inference causes device overheating - implement duty cycling, reduce inference frequency, monitor thermal zones
- Model download size: Large models over cellular data - offer Wi-Fi-only download, implement resumable downloads, use progressive model loading
- Version fragmentation: Model works on some devices but not others - test on representative device matrix, use NNAPI version checks, maintain device compatibility database
See Also
deploy-ml-model-serving- Cloud-based model serving (complement to edge)monitor-model-drift- Monitor model quality over timeregister-ml-model- Register models before edge deploymentcreate-dockerfile- Containerize edge model conversion pipelinecreate-multistage-dockerfile- Multi-stage builds for model conversion pipelines
GitHub 저장소
연관 스킬
qmd
개발qmd는 BM25, 벡터 임베딩, 재순위화를 결합한 하이브리드 검색을 통해 로컬 파일을 색인화하고 검색할 수 있는 로컬 검색 및 색인화 CLI 도구입니다. 명령줄 사용과 Claude 통합을 위한 MCP(Model Context Protocol) 모드를 모두 지원합니다. 이 도구는 임베딩에 Ollama를 사용하고 색인을 로컬에 저장하여 터미널에서 직접 문서나 코드베이스를 검색하는 데 이상적입니다.
subagent-driven-development
개발이 스킬은 각 독립적인 작업마다 새로운 하위 에이전트를 배치하고 작업 사이에 코드 리뷰를 진행하여 구현 계획을 실행합니다. 이 리뷰 프로세스를 통해 품질 게이트를 유지하면서 빠른 반복 작업을 가능하게 합니다. 동일한 세션 내에서 대부분 독립적인 작업을 진행할 때 내장된 품질 검증과 함께 지속적인 진행을 보장하기 위해 사용하세요.
mcporter
개발mcporter 스킬은 개발자가 Claude에서 직접 Model Context Protocol(MCP) 서버를 관리하고 호출할 수 있도록 합니다. 이 스킬은 사용 가능한 서버를 나열하고, 인수를 사용해 해당 서버의 도구를 호출하며, 인증 및 데몬 생명주기를 처리하는 명령어를 제공합니다. 개발 워크플로우에서 MCP 서버 기능을 통합하고 테스트할 때 이 스킬을 사용하세요.
adk-deployment-specialist
개발이 스킬은 A2A 프로토콜을 사용하여 Vertex AI ADK 에이전트를 배포하고 오케스트레이션하며, AgentCard 검색, 작업 제출, 코드 실행 샌드박스 및 메모리 뱅크와 같은 지원 도구를 관리합니다. Python, Java 또는 Go 언어로 순차, 병렬 또는 루프 오케스트레이션 패턴을 갖춘 다중 에이전트 시스템 구축을 가능하게 합니다. Google Cloud에서 ADK 에이전트 배포 또는 에이전트 워크플로우 오케스트레이션을 요청받았을 때 사용하세요.
