MCP HubMCP Hub
Retour aux compétences

deploy-edge-ai-model

pjt222
Mis à jour Yesterday
2 vues
17
2
17
Voir sur GitHub
Développementaiapi

À propos

Cette compétence déploie des modèles de ML sur des appareils périphériques en utilisant des frameworks tels que TensorFlow Lite et ONNX Runtime, en les optimisant via la quantification et des délégués matériels. Elle permet l'inférence sur l'appareil pour des scénarios où la connectivité cloud est limitée par la latence, le coût ou la fiabilité. Utilisez-la pour déployer sur des systèmes mobiles, IoT ou embarqués, avec des outils pour l'évaluation des performances et le déploiement spécifique à la plateforme.

Installation rapide

Claude Code

Recommandé
Principal
npx skills add pjt222/agent-almanac -a claude-code
Commande PluginAlternatif
/plugin add https://github.com/pjt222/agent-almanac
Git CloneAlternatif
git clone https://github.com/pjt222/agent-almanac.git ~/.claude/skills/deploy-edge-ai-model

Copiez et collez cette commande dans Claude Code pour installer cette compétence

Documentation

Deploy Edge AI Model

See Extended Examples for complete configuration files, quantization scripts, and benchmark templates.

ML → edge devices. Optimized inference, HW accel, on-device mgmt.

Use When

  • LLMs (Gemma 4, Phi, Llama) → mobile via Google AI Edge Gallery
  • Convert → TFLite/ONNX for on-device
  • Quantize → INT8/INT4, less mem + faster
  • Android/iOS apps w/ local AI
  • HW delegate select (GPU, NPU, DSP, Hexagon, CoreML)
  • Bench latency + mem on target
  • MediaPipe tasks → mobile/embedded

In

  • Required: Trained model (SavedModel, PyTorch, ONNX, HF checkpoint)
  • Required: Target platform (Android, iOS, Linux embedded, browser)
  • Required: Device constraints (RAM, storage, compute)
  • Optional: Calibration dataset → post-training quant
  • Optional: AI Edge Gallery config → LLM deploy
  • Optional: HW delegate prefs

Do

Step 1: Eval model → edge

Size, latency, device cap.

# assess_model.py
import os
import tensorflow as tf

def assess_model_for_edge(saved_model_path, target_ram_mb=4096):
    """Evaluate whether a model is suitable for edge deployment."""
    model = tf.saved_model.load(saved_model_path)

    # Check model size on disk
    model_size_mb = sum(
        os.path.getsize(os.path.join(dp, f))
        for dp, _, filenames in os.walk(saved_model_path)
        for f in filenames
    ) / (1024 * 1024)

    print(f"Model size: {model_size_mb:.1f} MB")
    print(f"Target RAM: {target_ram_mb} MB")
    print(f"Size/RAM ratio: {model_size_mb / target_ram_mb:.2%}")

    if model_size_mb > target_ram_mb * 0.25:
        print("WARNING: Model exceeds 25% of device RAM - quantization recommended")
        return False
    return True

Decision matrix:

Model SizeDevice RAMRecommended Action
< 50 MB2+ GBDirect TFLite conversion
50-500 MB4+ GBINT8 quantization + TFLite
500 MB-2 GB6+ GBINT4 quantization + AI Edge Gallery
2-4 GB8+ GBGemma 4 via AI Edge Gallery with INT4
> 4 GB12+ GBWeight streaming or cloud-edge hybrid

→ Assessment done, size/RAM ratios, quant recommendation by constraints.

If err: SavedModel path valid (ls saved_model/), TF installed (python -c "import tensorflow"), disk space OK, format supported.

Step 2: LLMs via Google AI Edge Gallery

Gemma 4 + LLMs → Android.

# Clone AI Edge Gallery
git clone https://github.com/nickoala/ai-edge-gallery.git
cd ai-edge-gallery

# Build the Android app
./gradlew assembleDebug

# Install on connected device
adb install -r app/build/outputs/apk/debug/app-debug.apk

Gemma 4 config:

{
  "models": [
    {
      "name": "Gemma 4 2B IT",
      "url": "https://huggingface.co/google/gemma-4-2b-it-gpu-int4",
      "format": "tflite",
      "backend": "gpu",
      "config": {
        "max_tokens": 1024,
        "temperature": 0.7,
        "top_k": 40,
        "top_p": 0.95
      }
    },
    {
      "name": "Gemma 4 4B IT",
      "url": "https://huggingface.co/google/gemma-4-4b-it-gpu-int4",
      "format": "tflite",
      "backend": "gpu",
      "config": {
        "max_tokens": 2048,
        "temperature": 0.7
      }
    }
  ]
}

Programmatic inference w/ LLM Inference API:

# gemma_edge_inference.py
from mediapipe.tasks.genai import llm_inference

# Configure the LLM
options = llm_inference.LlmInferenceOptions(
    model_path="/data/local/tmp/gemma-4-2b-it-int4.tflite",
    max_tokens=512,
    temperature=0.7,
    top_k=40,
    supported_lora_ranks=[4, 8, 16]  # Optional LoRA support
)

# Create inference engine
engine = llm_inference.LlmInference(options=options)

# Run inference
response = engine.generate_response("Explain edge computing in one sentence.")
print(response)

# Streaming inference
for chunk in engine.generate_response_async("List three benefits of on-device AI."):
    print(chunk, end="", flush=True)

→ App builds+installs, Gemma 4 downloads, coherent responses, GPU delegate active.

If err: SDK ≥ 26 (adb shell getprop ro.build.version.sdk), device storage OK, GPU delegate supported (adb logcat | grep -i delegate), HF access, ADB connection (adb devices).

Step 3: Convert + quantize w/ TFLite

Standard → TFLite w/ post-training quant.

# convert_tflite.py
import os
import tensorflow as tf
import numpy as np

def convert_to_tflite(saved_model_path, output_path, quantization="dynamic"):
    """Convert SavedModel to TFLite with quantization."""
    converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_path)

    if quantization == "dynamic":
        converter.optimizations = [tf.lite.Optimize.DEFAULT]

    elif quantization == "int8":
        converter.optimizations = [tf.lite.Optimize.DEFAULT]
        converter.target_spec.supported_ops = [
            tf.lite.OpsSet.TFLITE_BUILTINS_INT8
        ]
        converter.inference_input_type = tf.int8
        converter.inference_output_type = tf.int8

        # Representative dataset for calibration
        def representative_dataset():
            for _ in range(100):
                yield [np.random.randn(1, 224, 224, 3).astype(np.float32)]
        converter.representative_dataset = representative_dataset

    elif quantization == "float16":
        converter.optimizations = [tf.lite.Optimize.DEFAULT]
        converter.target_spec.supported_types = [tf.float16]

    tflite_model = converter.convert()

    with open(output_path, "wb") as f:
        f.write(tflite_model)

    original_size = sum(
        os.path.getsize(os.path.join(dp, f))
        for dp, _, filenames in os.walk(saved_model_path)
        for f in filenames
    ) / (1024 * 1024)
    quantized_size = len(tflite_model) / (1024 * 1024)
    print(f"Original: {original_size:.1f} MB -> Quantized: {quantized_size:.1f} MB")
    print(f"Compression ratio: {original_size / quantized_size:.1f}x")

# Usage
convert_to_tflite("saved_model/", "model_int8.tflite", quantization="int8")

ONNX Runtime quant alt:

# quantize_onnx.py
from onnxruntime.quantization import quantize_dynamic, quantize_static, QuantType

# Dynamic quantization (no calibration data needed)
quantize_dynamic(
    model_input="model.onnx",
    model_output="model_int8.onnx",
    weight_type=QuantType.QInt8
)

# Static quantization (better accuracy, needs calibration)
# ... (see EXAMPLES.md for complete calibration workflow)

→ TFLite gen'd, size -2-4x w/ INT8, accuracy within 1-2%, ONNX quant valid.

If err: TF ≥ 2.15, rep dataset matches input shape, all ops supported (converter.allow_custom_ops = True fallback), ONNX opset compat.

Step 4: HW delegates

Select + config.

# configure_delegates.py
import tensorflow as tf

def create_interpreter_with_delegate(model_path, delegate="gpu"):
    """Create TFLite interpreter with hardware delegate."""

    if delegate == "gpu":
        delegate_obj = tf.lite.experimental.load_delegate(
            "libtensorflowlite_gpu_delegate.so",
            options={"precision": "fp16", "allow_quantized_models": "true"}
        )
    elif delegate == "nnapi":
        # Android Neural Networks API - routes to NPU/DSP
        delegate_obj = tf.lite.experimental.load_delegate(
            "libtensorflowlite_nnapi_delegate.so"
        )
    elif delegate == "xnnpack":
        # Optimized CPU inference
        delegate_obj = None  # XNNPACK is default in TFLite

    interpreter = tf.lite.Interpreter(
        model_path=model_path,
        experimental_delegates=[delegate_obj] if delegate_obj else None,
        num_threads=4
    )
    interpreter.allocate_tensors()
    return interpreter

Delegate guide:

DeviceBest DelegateFallbackNotes
Android (Qualcomm)NNAPI -> Hexagon DSPGPU -> XNNPACKCheck nnapi_accelerator_name
Android (MediaTek)NNAPI -> APUGPU -> XNNPACKDimensity chips have dedicated APU
Android (Samsung)NNAPI -> NPUGPU -> XNNPACKExynos NPU via NNAPI
iOSCoreML delegateMetal GPUUse coreml_delegate for ANE
Linux embeddedGPU (if available)XNNPACKRPi uses XNNPACK CPU
BrowserWebGL / WebGPUWASM SIMDVia TensorFlow.js

→ Delegate loads, inference on accel, latency 2-10x vs CPU-only.

If err: Lib on device, delegate supported (adb shell cat /proc/cpuinfo), fall back XNNPACK, OpenCL for GPU, NNAPI ver.

Step 5: Bench on-device

Latency, mem, power.

# Use TFLite benchmark tool
adb push model_int8.tflite /data/local/tmp/

# CPU benchmark
adb shell /data/local/tmp/benchmark_model \
  --graph=/data/local/tmp/model_int8.tflite \
  --num_threads=4 \
  --num_runs=50 \
  --warmup_runs=5

# GPU benchmark
adb shell /data/local/tmp/benchmark_model \
  --graph=/data/local/tmp/model_int8.tflite \
  --use_gpu=true \
  --num_runs=50

# NNAPI benchmark
adb shell /data/local/tmp/benchmark_model \
  --graph=/data/local/tmp/model_int8.tflite \
  --use_nnapi=true \
  --nnapi_accelerator_name=google-edgetpu \
  --num_runs=50

Python bench:

# benchmark_edge.py
import time
import numpy as np
import psutil

def benchmark_inference(interpreter, input_data, num_runs=100):
    """Benchmark TFLite model inference."""
    input_details = interpreter.get_input_details()
    output_details = interpreter.get_output_details()

    # Warmup
    for _ in range(10):
        interpreter.set_tensor(input_details[0]["index"], input_data)
        interpreter.invoke()

    # Benchmark
    latencies = []
    mem_before = psutil.Process().memory_info().rss / (1024 * 1024)
    for _ in range(num_runs):
        start = time.perf_counter()
        interpreter.set_tensor(input_details[0]["index"], input_data)
        interpreter.invoke()
        latencies.append((time.perf_counter() - start) * 1000)
    mem_after = psutil.Process().memory_info().rss / (1024 * 1024)

    print(f"Latency (p50): {np.percentile(latencies, 50):.1f} ms")
    print(f"Latency (p95): {np.percentile(latencies, 95):.1f} ms")
    print(f"Latency (p99): {np.percentile(latencies, 99):.1f} ms")
    print(f"Memory delta: {mem_after - mem_before:.1f} MB")
    print(f"Throughput: {1000 / np.mean(latencies):.1f} inferences/sec")

→ Latency percentiles + mem + throughput. GPU 2-5x vs CPU. Gemma 4 2B → 10-30 tok/sec flagship.

If err: Bench binary matches arch (arm64-v8a), model pushed (adb shell ls /data/local/tmp/), storage OK, kill bg apps, thermal throttle check (adb shell cat /sys/class/thermal/thermal_zone*/temp).

Step 6: Package → prod

Mobile app w/ embedded/downloadable model.

// Android: EdgeAIManager.kt
import com.google.mediapipe.tasks.genai.llminference.LlmInference

class EdgeAIManager(private val context: Context) {
    private var llmInference: LlmInference? = null

    fun initialize(modelPath: String) {
        val options = LlmInference.LlmInferenceOptions.builder()
            .setModelPath(modelPath)
            .setMaxTokens(512)
            .setTemperature(0.7f)
            .setTopK(40)
            .setResultListener { result, done ->
                // Handle streaming tokens
                onTokenReceived(result, done)
            }
            .build()

        llmInference = LlmInference.createFromOptions(context, options)
    }

    fun generateResponse(prompt: String): String {
        return llmInference?.generateResponse(prompt)
            ?: throw IllegalStateException("Model not initialized")
    }

    fun release() {
        llmInference?.close()
        llmInference = null
    }
}

Download + cache:

// ModelDownloader.kt
class ModelDownloader(private val context: Context) {
    private val modelDir = File(context.filesDir, "models")

    suspend fun ensureModel(modelName: String, url: String): File {
        val modelFile = File(modelDir, modelName)
        if (modelFile.exists()) return modelFile

        modelDir.mkdirs()
        // Download with progress tracking
        // ... (see EXAMPLES.md for complete implementation)
        return modelFile
    }
}

→ App builds w/ MediaPipe, model loads first launch, latency OK, cached after download, fallback on unsupported.

If err: minSdk ≥ 26, MediaPipe dep ver, model SHA256, storage, ProGuard preserves MediaPipe classes, test multi-device.

Check

  • Model → TFLite/ONNX w/o op errs
  • Quant accuracy < 2% degrade
  • HW delegate loads + accels
  • Latency meets target (< 100ms vision, < 50ms/tok LLM)
  • Mem within budget
  • AI Edge Gallery runs Gemma 4
  • On-device LLM coherent
  • App handles download/cache/update
  • Graceful degrade on unsupported
  • Battery acceptable

Traps

  • Unsupported TFLite ops: Custom ops fail → converter.allow_custom_ops = True or replace, check compat list
  • Quant accuracy loss: INT4 degrades sensitive → mixed precision, calibrate w/ rep data
  • Delegate init fail: GPU crashes old devices → CPU fallback, check compat
  • Mem pressure: Model + app > RAM → memory-mapped, unload, batch=1
  • Thermal throttle: Sustained inference → overheat → duty cycle, reduce freq, monitor zones
  • Download size: Large over cellular → Wi-Fi-only, resumable, progressive
  • Version fragmentation: Works some not others → device matrix test, NNAPI ver checks, compat DB

  • deploy-ml-model-serving — cloud serving (complement to edge)
  • monitor-model-drift — quality over time
  • register-ml-model — register before edge deploy
  • create-dockerfile — containerize conversion pipeline
  • create-multistage-dockerfile — multi-stage builds

Dépôt GitHub

pjt222/agent-almanac
Chemin: i18n/caveman-ultra/skills/deploy-edge-ai-model
0
agentsagentskillsai-assisted-developmentclaude-codeskillsteams

Compétences associées

qmd

Développement

qmd est un outil CLI de recherche et d'indexation locale qui permet aux développeurs d'indexer et de rechercher dans des fichiers locaux en utilisant une recherche hybride combinant BM25, des embeddings vectoriels et du reranking. Il prend en charge à la fois une utilisation en ligne de commande et un mode MCP (Model Context Protocol) pour l'intégration avec Claude. L'outil utilise Ollama pour les embeddings et stocke les index localement, ce qui le rend idéal pour rechercher dans de la documentation ou des bases de code directement depuis le terminal.

Voir la compétence

subagent-driven-development

Développement

Cette compétence exécute des plans de mise en œuvre en déployant un nouveau sous-agent pour chaque tâche indépendante, avec une revue de code entre les tâches. Elle permet une itération rapide tout en maintenant des contrôles de qualité grâce à ce processus de revue. Utilisez-la lorsque vous travaillez sur des tâches principalement indépendantes au sein d'une même session pour assurer une progression continue avec des vérifications de qualité intégrées.

Voir la compétence

mcporter

Développement

La compétence mcporter permet aux développeurs de gérer et d'appeler des serveurs Model Context Protocol (MCP) directement depuis Claude. Elle fournit des commandes pour lister les serveurs disponibles, appeler leurs outils avec des arguments, et gérer l'authentification ainsi que le cycle de vie du démon. Utilisez cette compétence pour intégrer et tester les fonctionnalités des serveurs MCP dans votre flux de travail de développement.

Voir la compétence

adk-deployment-specialist

Développement

Cette compétence déploie et orchestre des agents Vertex AI ADK en utilisant le protocole A2A, gérant la découverte d'AgentCard, la soumission de tâches, et prenant en charge des outils tels que le bac à sable d'exécution de code et la banque de mémoire. Elle permet de construire des systèmes multi-agents avec des modèles d'orchestration séquentiels, parallèles ou en boucle en Python, Java ou Go. Utilisez-la lorsqu'on vous demande de déployer des agents ADK ou d'orchestrer des flux de travail d'agents sur Google Cloud.

Voir la compétence