remote-run-ssh

kungfuai

Updated Today

22 views

Documentationgeneral

About

This skill enables developers to run CVlization examples on a remote GPU host (`ssh l1`) by copying only the necessary example directory and shared library to `/tmp`, then executing the example's Docker scripts. It's designed for heavy training/evaluation tasks requiring CUDA and performance testing without altering the remote's main checkout. The approach keeps the remote copy minimal while leveraging existing Docker workflows for reproducibility.

Quick Install

Claude Code

Recommended

Plugin CommandRecommended

/plugin add https://github.com/kungfuai/CVlization

Git CloneAlternative

git clone https://github.com/kungfuai/CVlization.git ~/.claude/skills/remote-run-ssh

Copy and paste this command in Claude Code to install this skill

Documentation

Remote Run over SSH

Operate CVlization examples on the remote GPU reachable as ssh l1. This playbook keeps the remote copy minimal—just the target example folder (e.g., examples/perception/multimodal_multitask/recipe_analysis_torch) and the cvlization/ library—then relies on the example’s own build.sh / train.sh Docker helpers. The user’s long-lived checkout on the remote stays untouched.

When to Use

Heavy trainings or evaluations that require CUDA (CIFAR10 speed runs, multimodal pipelines, etc.).
Performance or regression measurements on the remote GPU after local code changes.
Producing reproducible logs / artifacts for discussions or CI baselines without pushing a branch first.

Prerequisites

Local repo state ready to sync (uncommitted changes acceptable).
SSH config already maps the GPU machine to l1.
Remote host provides NVIDIA GPU (currently A10) and Docker with GPU runtime enabled.
At least ~15 GB free under /tmp for the slim workspace, Docker context, and caches.
Hugging Face tokens or other creds available locally if the example pulls hub assets.

Quick Reference

Choose the example to run.
rsync only the example folder, cvlization/, and any required helper dirs to /tmp/cvlization_remote on l1.
On l1, run ./build.sh inside the example folder to build the Docker image.
Run ./train.sh (or the example’s equivalent) to launch the job with GPU access.
Collect logs / metrics and record the run in var/skills/remote-run-ssh/runs/<timestamp>/log.md.

Detailed Procedure

1. Identify the example and supporting files

Note the example path relative to repo root (e.g., examples/perception/multimodal_multitask/recipe_analysis_torch).
List any extra assets the run needs (custom configs under examples, top-level scripts, environment files, etc.).
Confirm cvlization/ includes all library modules the example imports.

2. Sync minimal workspace to `/tmp`

REMOTE_ROOT=/tmp/cvlization_remote
rsync -az --delete \
  --include='cvlization/***' \
  --include='examples/***' \
  --include='scripts/***' \
  --include='pyproject.toml' \
  --include='setup.cfg' \
  --include='README.md' \
  --exclude='*' \
  ./ l1:${REMOTE_ROOT}/

Tips:

Adjust the include list if the example needs additional files (e.g., docker-compose.yml, requirements.txt). The blanket --exclude='*' prevents unrelated directories from syncing.
Keep the remote path structure (${REMOTE_ROOT}/examples/...) aligned with the local repo so relative paths like ../../../.. used inside scripts still resolve to the repo root.
Avoid syncing .git, local datasets, .venv, or heavy cache directories.

3. Build the example image

ssh l1 'cd /tmp/cvlization_remote/examples/perception/multimodal_multitask/recipe_analysis_torch && ./build.sh'

The Docker build context is limited to the example folder, keeping builds quick.
Edit build.sh locally if you need custom base images or dependency tweaks before re-syncing.

4. Confirm GPU availability

ssh l1 'nvidia-smi'

Ensure no conflicting jobs are consuming the GPU before starting a long run.

5. Run the training script (Docker)

ssh l1 'cd /tmp/cvlization_remote/examples/perception/multimodal_multitask/recipe_analysis_torch && ./train.sh > run.log 2>&1'

Tail logs while the job runs:

ssh l1 'tail -f /tmp/cvlization_remote/examples/perception/multimodal_multitask/recipe_analysis_torch/run.log'

train.sh already mounts the example directory at /workspace, mounts the synced repo root read-only at /cvlization_repo, and sets PYTHONPATH=/cvlization_repo.
Customize environment variables or extra mounts by editing train.sh locally (e.g., injecting dataset paths, WANDB keys) then re-syncing.
If an example lacks Docker scripts, fall back to running its entrypoint directly (python train.py) inside the container or a temporary venv—but document the deviation in the run log.

6. Capture metrics

Log files should include per-epoch summaries and final metrics. Record:

Wall-clock time / throughput.
Accuracy, loss, or other task metrics.
Warnings or notable log lines (e.g., retry downloads, CUDA warnings).

7. Retrieve artifacts (optional)

rsync -az l1:/tmp/cvlization_remote/examples/perception/multimodal_multitask/recipe_analysis_torch/run.log \
  ./remote_runs/$(date +%Y%m%dT%H%M%S)_multimodal.log

Copy checkpoints, TensorBoard logs, or generated samples in the same manner if needed.

8. Document the run

Create var/skills/remote-run-ssh/runs/<timestamp>/log.md summarizing:
- Example path, git commit or diff basis.
- Commands executed (build.sh, train.sh args).
- Key metrics / observations.
- Location of logs or artifacts (local paths or remote references).

9. Cleanup (optional)

Delete /tmp/cvlization_remote when finished if the disk budget is tight (ssh l1 'rm -rf /tmp/cvlization_remote').
Clear cached datasets (rm -rf /root/.cache/... inside Docker) only if future runs should start fresh.

Troubleshooting

Missing module inside container: Ensure train.sh mounts the repo root and sets PYTHONPATH. Re-run rsync if files were added after the initial sync.
Docker build fails: Inspect the output of ./build.sh. Some examples assume base images with CUDA toolkits—update the Dockerfile accordingly.
CUDA OOM: Reduce batch sizes or precision, or run one job at a time on l1.
No GPU detected: Confirm --gpus=all is present in train.sh and check nvidia-smi on the host.
Long sync times: Tighten the rsync include rules to the specific example and library folders required.

Outputs

Every run should leave:

Remote workspace at /tmp/cvlization_remote containing the synced example + library.
Local or remote logs with the captured metrics.
A run log under var/skills/remote-run-ssh/runs/<timestamp>/log.md documenting the session.

GitHub Repository

kungfuai/CVlization

Path: .claude/skills/remote-run-ssh

aidecentralizeddockerfinetunegenerative-aiinference

Related Skills

subagent-driven-development

Development

This skill executes implementation plans by dispatching a fresh subagent for each independent task, with code review between tasks. It enables fast iteration while maintaining quality gates through this review process. Use it when working on mostly independent tasks within the same session to ensure continuous progress with built-in quality checks.

View skill

algorithmic-art

executing-plans

Design

Use the executing-plans skill when you have a complete implementation plan to execute in controlled batches with review checkpoints. It loads and critically reviews the plan, then executes tasks in small batches (default 3 tasks) while reporting progress between each batch for architect review. This ensures systematic implementation with built-in quality control checkpoints.

View skill

cost-optimization

Other

This Claude Skill helps developers optimize cloud costs through resource rightsizing, tagging strategies, and spending analysis. It provides a framework for reducing cloud expenses and implementing cost governance across AWS, Azure, and GCP. Use it when you need to analyze infrastructure costs, right-size resources, or meet budget constraints.

View skill

remote-run-ssh

About

Quick Install

Claude Code

Documentation

Remote Run over SSH

When to Use

Prerequisites

Quick Reference

Detailed Procedure

1. Identify the example and supporting files

2. Sync minimal workspace to /tmp

3. Build the example image

4. Confirm GPU availability

5. Run the training script (Docker)

6. Capture metrics

7. Retrieve artifacts (optional)

8. Document the run

9. Cleanup (optional)

Troubleshooting

Outputs

GitHub Repository

Related Skills

subagent-driven-development

algorithmic-art

executing-plans

cost-optimization

2. Sync minimal workspace to `/tmp`