serialize-data-formats
About
This skill helps developers serialize and deserialize data across multiple formats like JSON, XML, YAML, Protobuf, MessagePack, and Arrow/Parquet. It provides guidance on selecting the right format, implementing encode/decode patterns, and understanding performance tradeoffs and interoperability. Use it to choose wire formats for APIs, persist data, exchange between languages, or optimize for size and speed.
Quick Install
Claude Code
Recommendednpx skills add pjt222/agent-almanac -a claude-code/plugin add https://github.com/pjt222/agent-almanacgit clone https://github.com/pjt222/agent-almanac.git ~/.claude/skills/serialize-data-formatsCopy and paste this command in Claude Code to install this skill
Documentation
Serialize Data Formats
Select+impl right serialization format → correct encode/decode + perf awareness.
Use When
- Wire format for API
- Persist structured data → disk|object storage
- Exchange between langs
- Optimize size|speed
- Migrate formats
In
- Required: Data structure (schema|example)
- Required: Use case (API|storage|stream|analytics)
- Optional: Perf reqs (size|speed|schema enforce)
- Optional: Target lang|runtime constraints
- Optional: Human readability
Do
Step 1: Select Format
| Format | Human Readable | Schema | Size | Speed | Best For |
|---|---|---|---|---|---|
| JSON | Yes | Optional (JSON Schema) | Medium | Medium | REST APIs, config, broad interop |
| XML | Yes | XSD, DTD | Large | Slow | Enterprise/legacy, SOAP, documents |
| YAML | Yes | Optional | Medium | Slow | Config files, CI/CD, Kubernetes |
| Protocol Buffers | No | Required (.proto) | Small | Fast | gRPC, microservices, mobile |
| MessagePack | No | None | Small | Fast | Real-time, embedded, Redis |
| Arrow/Parquet | No | Built-in | Very Small | Very Fast | Analytics, columnar queries, data lakes |
Decision tree:
- Human edit? → YAML (config) | JSON (data)
- Strict schema + fast RPC? → Protobuf
- Smallest wire? → MessagePack | Protobuf
- Columnar analytics? → Parquet
- In-memory interchange? → Arrow
- Legacy enterprise? → XML
→ Format selected w/ documented rationale.
If err: reqs conflict (human + fast) → prioritize primary use case + note tradeoff.
Step 2: JSON Serialize
import json
from datetime import datetime, date
from dataclasses import dataclass, asdict
@dataclass
class Measurement:
sensor_id: str
value: float
unit: str
timestamp: datetime
# Custom encoder for non-standard types
class CustomEncoder(json.JSONEncoder):
def default(self, obj):
if isinstance(obj, datetime):
return obj.isoformat()
if isinstance(obj, date):
return obj.isoformat()
if isinstance(obj, bytes):
import base64
return base64.b64encode(obj).decode('ascii')
return super().default(obj)
# Serialize
measurement = Measurement("sensor-01", 23.5, "celsius", datetime.now())
json_str = json.dumps(asdict(measurement), cls=CustomEncoder, indent=2)
# Deserialize
data = json.loads(json_str)
# R: JSON with jsonlite
library(jsonlite)
# Serialize
df <- data.frame(sensor_id = "sensor-01", value = 23.5, unit = "celsius")
json_str <- jsonlite::toJSON(df, auto_unbox = TRUE, pretty = TRUE)
# Deserialize
df_back <- jsonlite::fromJSON(json_str)
→ Round-trip preserves all types accurately.
If err: type lost (dates → strings) → add explicit conversion in deserialize.
Step 3: Protobuf
.proto:
syntax = "proto3";
package sensors;
message Measurement {
string sensor_id = 1;
double value = 2;
string unit = 3;
int64 timestamp_ms = 4; // Unix milliseconds
}
message MeasurementBatch {
repeated Measurement measurements = 1;
}
Gen+use:
# Generate Python code
protoc --python_out=. sensors.proto
# Generate Go code
protoc --go_out=. sensors.proto
from sensors_pb2 import Measurement, MeasurementBatch
import time
# Serialize
m = Measurement(
sensor_id="sensor-01",
value=23.5,
unit="celsius",
timestamp_ms=int(time.time() * 1000)
)
binary = m.SerializeToString() # Compact binary
# Deserialize
m2 = Measurement()
m2.ParseFromString(binary)
→ Binary 3-10x smaller than JSON.
If err: protoc unavail → lang-native lib (betterproto Py).
Step 4: MessagePack
import msgpack
from datetime import datetime
# Custom packing for datetime
def encode_datetime(obj):
if isinstance(obj, datetime):
return {"__datetime__": True, "s": obj.isoformat()}
return obj
def decode_datetime(obj):
if "__datetime__" in obj:
return datetime.fromisoformat(obj["s"])
return obj
data = {"sensor_id": "sensor-01", "value": 23.5, "ts": datetime.now()}
# Serialize (smaller than JSON, faster than JSON)
packed = msgpack.packb(data, default=encode_datetime)
# Deserialize
unpacked = msgpack.unpackb(packed, object_hook=decode_datetime, raw=False)
→ Output 15-30% smaller than JSON for typical payloads.
If err: lang lacks MessagePack → fallback JSON+gzip.
Step 5: Parquet (Columnar)
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
# Create data
df = pd.DataFrame({
"sensor_id": ["s-01", "s-02", "s-01", "s-03"] * 1000,
"value": [23.5, 18.2, 24.1, 19.8] * 1000,
"unit": ["celsius"] * 4000,
"timestamp": pd.date_range("2025-01-01", periods=4000, freq="min")
})
# Write Parquet (columnar, compressed)
table = pa.Table.from_pandas(df)
pq.write_table(table, "measurements.parquet", compression="snappy")
# Read Parquet (can read specific columns without loading all data)
table_back = pq.read_table("measurements.parquet", columns=["sensor_id", "value"])
df_subset = table_back.to_pandas()
# R: Parquet with arrow
library(arrow)
# Write
df <- data.frame(sensor_id = rep("s-01", 1000), value = rnorm(1000))
arrow::write_parquet(df, "measurements.parquet")
# Read (with column selection — only reads selected columns from disk)
df_back <- arrow::read_parquet("measurements.parquet", col_select = c("value"))
→ Parquet 5-20x smaller than CSV for tabular.
If err: Arrow unavail → fastparquet (Py)|CSV+gzip fallback.
Step 6: Compare Perf
import json, msgpack, time
import pyarrow as pa, pyarrow.parquet as pq
data = [{"id": i, "value": i * 0.1, "label": f"item-{i}"} for i in range(10000)]
# JSON
start = time.perf_counter()
json_bytes = json.dumps(data).encode()
json_time = time.perf_counter() - start
# MessagePack
start = time.perf_counter()
msgpack_bytes = msgpack.packb(data)
msgpack_time = time.perf_counter() - start
print(f"JSON: {len(json_bytes):>8} bytes, {json_time*1000:.1f} ms")
print(f"MsgPack: {len(msgpack_bytes):>8} bytes, {msgpack_time*1000:.1f} ms")
→ Benchmarks guide format for prod.
If err: insufficient perf any format → consider compression (zstd, snappy) as orthogonal optimization.
Check
- Format matches use case (rationale documented)
- Round-trip preserves all types
- Edge cases: empty, null, Unicode, large nums
- Perf benchmarked for representative sizes
- Err handling for malformed (graceful fail)
- Schema documented (JSON Schema|.proto|equiv)
Traps
- Float precision: JSON = IEEE 754 doubles. String encoding for financial.
- Date/time: No native JSON datetime. Always document format (ISO 8601) + timezone.
- Schema evolution: Add|remove fields can break consumers. Protobuf good; JSON needs careful versioning.
- Binary in JSON: Base64 inflates ~33%. Binary format for binary-heavy.
- YAML security: Parsers may exec arbitrary code via
!!python/objecttags. Always safe loaders.
→
design-serialization-schema— schema design, versioning, evolutionimplement-pharma-serialisation— pharma serialisation (diff domain, same naming)create-quarto-report— data output for reports
GitHub Repository
Related Skills
qmd
Developmentqmd is a local search and indexing CLI tool that enables developers to index and search through local files using hybrid search combining BM25, vector embeddings, and reranking. It supports both command-line usage and MCP (Model Context Protocol) mode for integration with Claude. The tool uses Ollama for embeddings and stores indexes locally, making it ideal for searching documentation or codebases directly from the terminal.
subagent-driven-development
DevelopmentThis skill executes implementation plans by dispatching a fresh subagent for each independent task, with code review between tasks. It enables fast iteration while maintaining quality gates through this review process. Use it when working on mostly independent tasks within the same session to ensure continuous progress with built-in quality checks.
mcporter
DevelopmentThe mcporter skill enables developers to manage and call Model Context Protocol (MCP) servers directly from Claude. It provides commands to list available servers, call their tools with arguments, and handle authentication and daemon lifecycle. Use this skill for integrating and testing MCP server functionality in your development workflow.
adk-deployment-specialist
DevelopmentThis skill deploys and orchestrates Vertex AI ADK agents using A2A protocol, managing AgentCard discovery, task submission, and supporting tools like Code Execution Sandbox and Memory Bank. It enables building multi-agent systems with sequential, parallel, or loop orchestration patterns in Python, Java, or Go. Use it when asked to deploy ADK agents or orchestrate agent workflows on Google Cloud.
