data-anonymizer

majiayu000

Updated 2 days ago

Communicationaidata

About

This Claude skill detects and masks PII like names, emails, and phone numbers in both text strings and CSV files. It offers multiple anonymization strategies, including reversible tokenization for data mapping. Use it to quickly sanitize sensitive data in documents or structured datasets for development or testing.

Quick Install

Claude Code

Recommended

Plugin CommandRecommended

/plugin add https://github.com/majiayu000/claude-skill-registry

Git CloneAlternative

git clone https://github.com/majiayu000/claude-skill-registry.git ~/.claude/skills/data-anonymizer

Copy and paste this command in Claude Code to install this skill

Documentation

Data Anonymizer

Detect and mask personally identifiable information (PII) in text documents and structured data. Supports multiple masking strategies and can process CSV files at scale.

Quick Start

from scripts.data_anonymizer import DataAnonymizer

# Anonymize text
anonymizer = DataAnonymizer()
result = anonymizer.anonymize("Contact John Smith at [email protected] or 555-123-4567")
print(result)
# "Contact [NAME] at [EMAIL] or [PHONE]"

# Anonymize CSV
anonymizer.anonymize_csv("customers.csv", "customers_anon.csv")

Features

PII Detection: Names, emails, phones, SSN, addresses, credit cards, dates
Multiple Strategies: Mask, redact, hash, fake data replacement
CSV Processing: Anonymize specific columns or auto-detect
Reversible Tokens: Optional mapping for de-anonymization
Custom Patterns: Add your own PII patterns
Audit Report: List all detected PII with locations

API Reference

Initialization

anonymizer = DataAnonymizer(
    strategy="mask",      # mask, redact, hash, fake
    reversible=False      # Enable token mapping
)

Text Anonymization

# Basic anonymization
result = anonymizer.anonymize(text)

# With specific PII types
result = anonymizer.anonymize(text, pii_types=["email", "phone"])

# Get detected PII report
result, report = anonymizer.anonymize(text, return_report=True)

Masking Strategies

text = "Email [email protected], call 555-1234"

# Mask (default) - replace with type labels
anonymizer.strategy = "mask"
# "Email [EMAIL], call [PHONE]"

# Redact - replace with asterisks
anonymizer.strategy = "redact"
# "Email ***************, call ********"

# Hash - replace with hash
anonymizer.strategy = "hash"
# "Email a1b2c3d4, call e5f6g7h8"

# Fake - replace with realistic fake data
anonymizer.strategy = "fake"
# "Email [email protected], call 555-9876"

CSV Processing

# Auto-detect PII columns
anonymizer.anonymize_csv("input.csv", "output.csv")

# Specify columns
anonymizer.anonymize_csv(
    "input.csv",
    "output.csv",
    columns=["name", "email", "phone"]
)

# Different strategies per column
anonymizer.anonymize_csv(
    "input.csv",
    "output.csv",
    column_strategies={
        "name": "fake",
        "email": "hash",
        "ssn": "redact"
    }
)

Reversible Anonymization

anonymizer = DataAnonymizer(reversible=True)

# Anonymize with token mapping
result = anonymizer.anonymize("John Smith: [email protected]")
mapping = anonymizer.get_mapping()

# Save mapping securely
anonymizer.save_mapping("mapping.json", encrypt=True, password="secret")

# Later, de-anonymize
anonymizer.load_mapping("mapping.json", password="secret")
original = anonymizer.deanonymize(result)

Custom Patterns

# Add custom PII pattern
anonymizer.add_pattern(
    name="employee_id",
    pattern=r"EMP-\d{6}",
    label="[EMPLOYEE_ID]"
)

CLI Usage

# Anonymize text file
python data_anonymizer.py --input document.txt --output document_anon.txt

# Anonymize CSV
python data_anonymizer.py --input customers.csv --output customers_anon.csv

# Specific strategy
python data_anonymizer.py --input data.csv --output anon.csv --strategy fake

# Generate audit report
python data_anonymizer.py --input document.txt --report audit.json

# Specific PII types only
python data_anonymizer.py --input doc.txt --types email phone ssn

CLI Arguments

Argument	Description	Default
`--input`	Input file	Required
`--output`	Output file	Required
`--strategy`	Masking strategy	mask
`--types`	PII types to detect	all
`--columns`	CSV columns to process	auto
`--report`	Generate audit report	-
`--reversible`	Enable token mapping	False

Supported PII Types

Type	Examples	Pattern
`name`	John Smith, Mary Johnson	NLP-based
`email`	[email protected]	Regex
`phone`	555-123-4567, (555) 123-4567	Regex
`ssn`	123-45-6789	Regex
`credit_card`	4111-1111-1111-1111	Regex + Luhn
`address`	123 Main St, City, ST 12345	NLP + Regex
`date_of_birth`	01/15/1990, January 15, 1990	Regex
`ip_address`	192.168.1.1	Regex

Examples

Anonymize Customer Support Logs

anonymizer = DataAnonymizer(strategy="mask")

log = """
Ticket #1234: Customer John Doe ([email protected]) called about
billing issue. SSN on file: 123-45-6789. Callback number: 555-867-5309.
Address: 123 Oak Street, Springfield, IL 62701.
"""

result = anonymizer.anonymize(log)
print(result)
# Ticket #1234: Customer [NAME] ([EMAIL]) called about
# billing issue. SSN on file: [SSN]. Callback number: [PHONE].
# Address: [ADDRESS].

GDPR Compliance for Database Export

anonymizer = DataAnonymizer(strategy="hash")

# Consistent hashing for joins
anonymizer.anonymize_csv(
    "users.csv",
    "users_anon.csv",
    columns=["email", "name", "phone"]
)

anonymizer.anonymize_csv(
    "orders.csv",
    "orders_anon.csv",
    columns=["customer_email"]  # Same hash as users.email
)

Generate Test Data from Production

anonymizer = DataAnonymizer(strategy="fake")

# Replace real PII with realistic fake data
anonymizer.anonymize_csv(
    "production_data.csv",
    "test_data.csv"
)

# Test data has same structure but fake PII

Dependencies

pandas>=2.0.0
faker>=18.0.0

Limitations

Name detection may miss unusual names
Address detection works best for US formats
Custom patterns may be needed for domain-specific PII
Fake data replacement doesn't preserve exact format

GitHub Repository

majiayu000/claude-skill-registry

Path: skills/data-anonymizer

Related Skills

content-collections

sglang

evaluating-llms-harness

Testing

This Claude Skill runs the lm-evaluation-harness to benchmark LLMs across 60+ standardized academic tasks like MMLU and GSM8K. It's designed for developers to compare model quality, track training progress, or report academic results. The tool supports various backends including HuggingFace and vLLM models.

View skill