SKILL·7F5217

merge-duplicate-companies

Name: merge-duplicate-companies
Author: TomGranot

TomGranot

업데이트됨 1 month ago

9 조회

디자인aiapidesign

정보

이 스킬은 도메인과 이름을 기준으로 HubSpot에서 중복된 회사 기록을 식별한 후, 병합 과정을 안내하는 감사 CSV 파일을 내보냅니다. HubSpot API를 사용하여 중복 항목을 탐지하지만, HubSpot에는 대량 병합 API가 없기 때문에 실제 병합 작업은 타사 도구나 수동 UI 병합을 통해 진행해야 합니다. 분산된 연락처 및 거래 데이터를 통합하여 데이터베이스 정리 작업에 활용하세요.

빠른 설치

Claude Code

문서

Merge Duplicate Companies

Purpose

Duplicate company records fragment contacts, deals, and engagement history across multiple records for the same real-world company. This leads to inaccurate reporting, broken associations, sales confusion, and workflow failures. This skill identifies duplicates by domain and by name, exports prioritized audit CSVs, and guides the user through merging.

Prerequisites

A HubSpot private app access token with crm.objects.companies.read scope
Python 3.10+ with uv for package management
A .env file containing HUBSPOT_ACCESS_TOKEN
Super Admin permissions for merging in the HubSpot UI

Key Constraint

HubSpot has no bulk merge API. Merging must happen one pair at a time through the HubSpot UI or via third-party tools. The API is used for discovery, analysis, and audit trail generation.

HubSpot's built-in Duplicates tool is NOT available on all plan tiers. Check whether the account has access to Settings > Data Management > Duplicates before relying on it.

Execution Pattern

This skill follows a 4-stage execution pattern: Plan -> Before State -> Execute -> After State.

Stage 1: Plan

Before writing any code, confirm with the user:

Confirm intentional duplicates: Ask whether separate records for regional offices of the same company are intentional. If so, exclude those from merging.
Merging is irreversible. Once two company records are merged, they cannot be un-merged. The surviving record inherits all associations, but property values from the deleted record may be lost if both have the same property filled in.
Prioritization strategy: Recommend merging Customer-stage companies first, then Opportunity-stage, then everything else.
Time estimate: This is the most time-consuming process. Budget 2-4 hours for critical duplicates, 8-12 hours total for full cleanup.

Stage 2: Before State

Fetch all companies, identify duplicate groups by domain and name, and export audit CSVs.

"""
Before State: Identify duplicate companies by domain and by name.
Creates CSV audit logs for review before merging.
"""
import os
import csv
import time
import requests
from collections import defaultdict
from dotenv import load_dotenv

load_dotenv()

TOKEN = os.environ["HUBSPOT_ACCESS_TOKEN"]
BASE = "https://api.hubapi.com"
headers = {
    "Authorization": f"Bearer {TOKEN}",
    "Content-Type": "application/json",
}

# --- Step 1: Fetch all companies ---
print("Fetching all companies...")

all_companies = []
after = None

while True:
    params = {
        "limit": 100,
        "properties": "name,domain,lifecyclestage,num_associated_contacts,"
                       "num_associated_deals,hubspot_owner_id,createdate",
    }
    if after:
        params["after"] = after

    resp = requests.get(
        f"{BASE}/crm/v3/objects/companies",
        headers=headers, params=params,
    )
    if resp.status_code != 200:
        print(f"Stopped at {len(all_companies)} (status {resp.status_code})")
        break

    data = resp.json()
    for company in data.get("results", []):
        props = company.get("properties", {})
        all_companies.append({
            "id": company["id"],
            "name": (props.get("name") or "").strip(),
            "domain": (props.get("domain") or "").strip().lower(),
            "lifecycle_stage": props.get("lifecyclestage", ""),
            "associated_contacts": props.get("num_associated_contacts", "0"),
            "associated_deals": props.get("num_associated_deals", "0"),
            "owner_id": props.get("hubspot_owner_id", ""),
            "createdate": props.get("createdate", ""),
        })

    paging = data.get("paging", {})
    after = paging.get("next", {}).get("after")
    if not after:
        break
    time.sleep(0.05)

print(f"Total companies fetched: {len(all_companies)}")

# --- Step 2: Find duplicates by domain ---
print("\nAnalyzing duplicates by domain...")

domain_groups = defaultdict(list)
for c in all_companies:
    if c["domain"]:
        domain_groups[c["domain"]].append(c)

dup_domain_groups = {d: cs for d, cs in domain_groups.items() if len(cs) > 1}
dup_domain_records = sum(len(cs) for cs in dup_domain_groups.values())

print(f"Unique domains with duplicates: {len(dup_domain_groups)}")
print(f"Total records in duplicate domain groups: {dup_domain_records}")

# Top offenders
sorted_domains = sorted(dup_domain_groups.items(), key=lambda x: len(x[1]), reverse=True)
print("\nTop duplicate domains:")
for domain, companies in sorted_domains[:15]:
    print(f"  {domain}: {len(companies)} records")

# --- Step 3: Find duplicates by name ---
print("\nAnalyzing duplicates by name...")

name_groups = defaultdict(list)
for c in all_companies:
    if c["name"]:
        name_groups[c["name"].lower()].append(c)

dup_name_groups = {n: cs for n, cs in name_groups.items() if len(cs) > 1}
dup_name_records = sum(len(cs) for cs in dup_name_groups.values())

print(f"Unique names with duplicates: {len(dup_name_groups)}")
print(f"Total records in duplicate name groups: {dup_name_records}")

sorted_names = sorted(dup_name_groups.items(), key=lambda x: len(x[1]), reverse=True)
print("\nTop duplicate names:")
for name_lower, companies in sorted_names[:15]:
    print(f"  {companies[0]['name']}: {len(companies)} records")

# --- Step 4: Save CSV audit logs ---
os.makedirs("data/audit-logs", exist_ok=True)

# Domain duplicates CSV
domain_csv = "data/audit-logs/duplicate-companies-by-domain.csv"
with open(domain_csv, "w", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=[
        "domain", "duplicate_count", "id", "name", "lifecycle_stage",
        "associated_contacts", "associated_deals", "owner_id", "createdate",
    ])
    writer.writeheader()
    for domain, companies in sorted_domains:
        for c in companies:
            writer.writerow({
                "domain": domain,
                "duplicate_count": len(companies),
                **{k: c[k] for k in [
                    "id", "name", "lifecycle_stage", "associated_contacts",
                    "associated_deals", "owner_id", "createdate",
                ]},
            })

print(f"\nDomain duplicates CSV: {domain_csv}")

# Name duplicates CSV
name_csv = "data/audit-logs/duplicate-companies-by-name.csv"
with open(name_csv, "w", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=[
        "duplicate_name", "duplicate_count", "id", "name", "domain",
        "lifecycle_stage", "associated_contacts", "associated_deals",
        "owner_id", "createdate",
    ])
    writer.writeheader()
    for name_lower, companies in sorted_names:
        for c in companies:
            writer.writerow({
                "duplicate_name": name_lower,
                "duplicate_count": len(companies),
                **{k: c[k] for k in [
                    "id", "name", "domain", "lifecycle_stage",
                    "associated_contacts", "associated_deals",
                    "owner_id", "createdate",
                ]},
            })

print(f"Name duplicates CSV: {name_csv}")

Present findings to the user. Key data points:

Total duplicate domain groups and affected records
Total duplicate name groups and affected records
Top offenders by domain and name
CSVs for manual review

Stage 3: Execute

This stage is primarily manual. Guide the user through the merging process.

Option A: HubSpot Built-In Duplicates Tool (if available)

Navigate to Settings > Data Management > Duplicates > Companies
HubSpot shows suggested duplicate pairs ranked by confidence
For each pair, click Review to see side-by-side comparison
Select the "primary" (surviving) record based on:
- More associated contacts
- More associated deals
- More recent activity
- Has a company owner
- More complete property data
Click Merge
Process ~50 pairs at a time; HubSpot loads the next batch automatically

Prioritization order:

Customer-stage company duplicates (highest value data)
Opportunity-stage company duplicates
Everything else (Leads, Subscribers)

Option B: Manual search-and-merge for top offenders

For companies with many duplicates (4+ records):

Search for the company by name in Contacts > Companies
Identify the "winner" record (most associations, deals, activity)
Open the winner record > Actions > Merge
Search for the duplicate > select it > choose property values > Merge
Repeat until only one record remains

Option C: Third-party deduplication tools

For large-scale merging, recommend:

Dedupely (dedupely.com) -- HubSpot-native integration, bulk merge
Insycle (insycle.com) -- Data management platform with dedup
Koalify (koalify.com) -- HubSpot duplicate management

These tools can automate bulk merges that would take hours manually.

Prevention: Configure auto-association after merging

Settings > Data Management > Companies (or Settings > Objects > Companies)
Enable: "Create and associate companies with contacts"
Set unique identifier: Company domain name

This prevents future duplicates by using domain-based matching instead of name-based.

Stage 4: After State

Re-run the Before State analysis and compare duplicate counts.

"""
After State: Verify duplicate reduction.
"""
# Re-fetch all companies and re-run duplicate analysis
# Compare:
#   - Number of duplicate domain groups (should decrease)
#   - Number of duplicate name groups (should decrease)
#   - Top offenders (should be resolved)

# Also verify merged records:
# For each known duplicate that was merged, search for the company
# and confirm only one record exists with all expected associations.

Manual verification:

Search for top offenders by name (should show only 1 record each)
Open merged records and verify contacts and deals from both originals appear
Check Settings > Data Management > Duplicates -- count should be significantly lower

Safety Mechanisms

Mechanism	Detail
CSV audit trail	Complete export of all companies with duplicate group annotations before any merging.
Prioritized approach	Customer and Opportunity companies merged first to protect highest-value data.
Review before merge	CSVs enable team review before any irreversible merges happen.
Confirmation prompt	Present duplicate analysis to the user and wait for explicit confirmation before instructing merges.
No auto-merge	This skill never merges automatically. All merges require manual human decision.

Technical Gotchas

HubSpot has no bulk merge API. There is no programmatic way to merge companies. All merges happen through the UI or third-party tools.
Merging is irreversible. Once merged, records cannot be split apart. When in doubt, skip a pair and revisit later.
Property conflicts: When both records have a value for the same property, HubSpot keeps the value from the "primary" record. Review important properties (phone, address, industry) before confirming.
Companies endpoint uses GET, not POST/search. To list all companies, use GET /crm/v3/objects/companies with pagination, not the Search API. The Search API works too but is slower for full exports.
Domain normalization: Always lowercase and strip whitespace from domains before grouping. Example.com and example.com are the same company.
Name-based duplicates have higher false-positive rates. "State University" might match multiple genuinely different institutions. Domain-based duplicates are more reliable.
Contact reassociation: After merging, verify that contacts from both original records appear under the surviving record. HubSpot should handle this automatically, but spot-check.
The Duplicates tool is plan-tier dependent. Not all HubSpot plans include it. Check availability before instructing the user to navigate there.

Package Setup

uv init hubspot-cleanup
cd hubspot-cleanup
uv add requests python-dotenv

Create a .env file:

HUBSPOT_ACCESS_TOKEN=pat-na1-xxxxxxxx

GitHub 저장소

TomGranot/hubspot-admin-skills

경로: skills/merge-duplicate-companies

hubspothubspot-apihubspot-crmhubspot-integration

FAQ

Frequently asked questions

What is the merge-duplicate-companies skill?

merge-duplicate-companies is a Claude Skill by TomGranot. Skills package instructions and resources that Claude loads on demand, so Claude can perform merge-duplicate-companies-related tasks without extra prompting.

How do I install merge-duplicate-companies?

Use the install commands on this page: add merge-duplicate-companies to Claude Code as a plugin, or clone its repository into your skills directory, then restart Claude so it picks up the skill.

What category does merge-duplicate-companies belong to?

merge-duplicate-companies is in the Design category, tagged ai, api and design.

Is merge-duplicate-companies free to use?

Yes. merge-duplicate-companies is listed on AIMCP and free to install. It runs inside Claude, so no separate service account is required to use the skill itself.

연관 스킬

executing-plans

디자인

executing-plans 스킬은 검토 체크포인트가 포함된 통제된 배치로 실행할 완전한 구현 계획이 있을 때 사용합니다. 이 스킬은 계획을 불러와 비판적으로 검토한 후, 소규모 배치(기본값 3개 작업)로 작업을 실행하면서 각 배치 사이에 진행 상황을 아키텍트 검토를 위해 보고합니다. 이를 통해 내재된 품질 관리 체크포인트를 갖춘 체계적인 구현이 보장됩니다.

스킬 보기

requesting-code-review

디자인

이 스킬은 코드 변경 사항을 요구 사항에 따라 분석하기 위해 코드 리뷰어 하위 에이전트를 호출합니다. 작업 완료 후, 주요 기능 구현 후, 또는 메인 브랜치에 병합하기 전에 사용해야 합니다. 이 리뷰는 현재 구현체와 원래 계획을 비교하여 문제를 조기에 발견하는 데 도움이 됩니다.

스킬 보기

connect-mcp-server

디자인

이 스킬은 개발자들이 HTTP, stdio 또는 SSE 전송 방식을 통해 MCP 서버를 Claude Code에 연결하는 포괄적인 가이드를 제공합니다. GitHub, Notion 및 사용자 정의 API와 같은 외부 서비스를 통합하기 위한 설치, 구성, 인증 및 보안을 다룹니다. MCP 통합 설정, 외부 도구 구성 또는 Claude의 모델 컨텍스트 프로토콜 작업 시 활용하세요.

스킬 보기

web-cli-teleport

디자인

이 스킬은 작업 분석을 기반으로 개발자가 Claude Code 웹 인터페이스와 CLI 인터페이스 중 선택할 수 있도록 돕고, 두 환경 간 원활한 세션 텔레포트를 가능하게 합니다. 웹, CLI 또는 모바일 환경 전환 시 세션 상태와 컨텍스트를 관리하여 워크플로를 최적화합니다. 다양한 단계에서 서로 다른 도구가 필요한 복잡한 프로젝트에 사용하세요.

스킬 보기