Fragmented ID Resolution: Using Deep Learning to Match Noisy Identity Records

This repository is for the Fragmented ID Resolution project as part of the fulfillment of The Erdős Institute Deep Learning Bootcamp for Spring 2026.

Team members: Noimot Bakare Ayoub, Pedro Fontanarrosa, Arpith Shanbhag, Dharineesh Somisetty

Presentation Deliverables

Traditional matching methods break down in these high-noise environments, motivating the need for a more robust approach. This project proposes a deep learning framework for fragmented identity resolution: a Siamese neural network that learns an adaptive similarity function over identity record pairs, significantly outperforming both rule-based and embedding-only baselines — especially on hard cases where duplicates vary across multiple fields simultaneously.

Our best model achieves 99.4% F1 overall and 98.3% F1 on hard cases, with a +25 point improvement over the deep learning baseline on hard-case F1 (0.71 → 0.96).

Dataset

We utilize the Hasso Plattner Institute (HPI) North Carolina State Board of Election (NCSBE) voter registration dataset, a standard benchmark for duplicate detection research.

Dataset summary:

Statistic	Count
Voter records	14,183
Duplicate pairs (labeled)	9,891
Non-duplicate pairs (labeled)	98,142
Class ratio (non-dup : dup)	~10 : 1

Each voter record contains attributes including first name, last name, middle name, age, sex, race, ethnicity, house number, street name, street type, and ZIP code.

Data preparation:

Augmented data to reflect real-world variations (typos, nicknames, surname changes, ZIP variations)
Constructed hard negatives to capture confusable identities (different people who look similar)
Group-based entity-disjoint train / validation / test split to prevent data leakage during evaluation

Methods and Models

Why Simple Thresholding Fails

Duplicates vary across fields — the same pair can be "close" in one attribute and "far" in another (e.g., same address but different last name due to marriage)
Similarity scores overlap — cosine similarity distributions for duplicates and non-duplicates overlap significantly, causing single-threshold approaches to misclassify borderline cases

Models

We developed and compared three model families:

Model	Approach	Strengths	Limitations
Model 1: Logistic Regression + XGBoost	Manually engineered features (string distances, exact matches)	Solves typos, nicknames, ZIP variance	Struggles with last name differences
Model 2: CNN Embedding + Cosine Similarity	Converts records into embeddings, compares via cosine similarity	Captures some variations	Poor separation on hard cases
Model 3: Siamese Network + Learned Weights	Shared encoder learns both agreement and difference signals; adaptive similarity function	Resolves fragmentation across variations; distinguishes similar but different individuals	Requires more training data and compute

Siamese Network Architecture

The Siamese network is the core contribution of this project:

Both records are encoded through a shared network (BiLSTM or CharCNN encoder)
The model learns a similarity function from how record embeddings agree and differ
An adaptive similarity function improves hard-case detection by assigning higher weight to meaningful differences
The loss function uses binary cross-entropy: P(Similar) * log(similarity) + P(Different) * log(1 - similarity)

Key design choices:

Character-level encoding to handle typos and abbreviations
Absolute difference between pair vectors as input to the classifier head
Weighted loss to handle 10:1 class imbalance
Hard-example mining and weighting to focus learning on difficult cases
Entity-disjoint splits to prevent leakage

Hard-Example Mining

We mine hard positives (true duplicates that look very different) and hard negatives (different people who look very similar) directly from the labeled data. These hard examples are used as training weights to bias the model toward learning difficult cases without distorting the original label distribution.

Current hard-example counts:

Hard positives: 1,950
Hard negatives: 1,794

Results

The Siamese model significantly improves performance on hard cases compared to both the rule-based and embedding-only baselines.

Table 1: Test-set performance across all models

Model	Precision		Recall		F1
	Overall	Hard	Overall	Hard	Overall	Hard
XGBoost	.97	.86	.97	.82	.97	.84
Embedding + Cosine (DL Baseline)	.94	.65	.91	.78	.92	.71
Siamese + Weights	.99	.93	.99	.99	.99	.96

Hard-case F1 improvement: +25 points (0.71 → 0.96) over the deep learning baseline.

Detailed Comparison on Blocked Extended-Attribute Setting

Table 2: Test-set metrics with blocking and extended attributes

Model	F1	PR-AUC	Hard-subset F1	Hard-positive Recall	Hard-negative Rejection
TF-IDF baseline	0.9630	0.9945	0.8793	0.8361	0.9130
Siamese BiLSTM (initial)	0.9849	0.9994	0.9677	0.9836	0.9348
Siamese BiLSTM (tuned)	0.9891	0.9994	0.9833	0.9672	1.0000

The tuned Siamese model achieves perfect rejection on mined hard negatives while maintaining high overall quality.

Deployment-Default Model Metrics

The final deployed model incorporates middle-name features and sex-aware pair features:

Metric	Result
F1	0.9935
PR-AUC	0.9998
Hard-subset F1	0.9833
Easy-subset F1	0.9971
Hard-positive Recall	0.9672
Hard-negative Rejection	1.0000

Key Performance Indicators (KPIs)

Model Performance KPIs

Our primary KPIs focus on performance where it matters most — on hard cases that traditional methods miss:

KPI	Target	Achieved
Overall F1	> 0.95	0.9935
Hard-case F1	> 0.90	0.9833
Hard-positive Recall	> 0.95	0.9672
Hard-negative Rejection	> 0.95	1.0000
PR-AUC	> 0.99	0.9998

Business KPIs

The blocking stage reduces candidate pair comparisons by ~87% while maintaining 100% positive pass-through:

Metric	Result	Interpretation
Positive pass rate	100%	No true duplicates missed by blocking
Negative pass rate	3.5%	96.5% of non-duplicate pairs eliminated early
Candidate reduction	303 / 2,291	~87% fewer comparisons needed

Challenges

Developing a fragmented identity resolution model presents several challenges:

Feature Variation Across Fields: Duplicates can match on some attributes (e.g., address) while differing on others (e.g., last name after marriage). No single similarity threshold captures all cases.
Hard Negatives: Different individuals who share similar attributes (same first name, same street, close age) create confusable pairs that push false positive rates up.
Class Imbalance: Non-duplicate pairs outnumber duplicates ~10:1, making it difficult for models to learn the minority class.
Surname Expansion and Nickname Variation: Real-world name changes (e.g., "Liz Miller-Davis" vs. "Elisabeth Millar") require the model to look beyond character-level similarity.
Entity-Disjoint Evaluation: Standard random splits can leak entity information across train/test. We enforce entity-disjoint splits to get honest generalization estimates.

Deployment

LinkID — Identity Matching Simplified

The project includes LinkID, a web application for production-style identity resolution:

Upload voter registration data (CSV/TSV)
Detect and rank likely duplicates using the deployed Siamese model
Review high-confidence matches and borderline cases close to the threshold
Export duplicate lists and human-review queues

Features:

Bulk duplicate scan and single-record clerk check modes
Disagreement sections labeled as "Review recommended" with exact matching/differing fields shown
Human-review queue with accept/reject/uncertain decisions and optional notes
SQLite-backed review persistence
Export buttons for duplicate CSV and human-review CSV

Quick start:

# Using Docker (recommended)
docker compose up --build

# Or manually
conda env create -f environment.yml
conda activate fragmented-id
uvicorn src.api:app --reload --host 0.0.0.0 --port 8000

Web app: http://127.0.0.1:8000/
API docs: http://127.0.0.1:8000/docs

See demo/README.md for detailed runtime and API usage notes.

Files

Core Source Code

demo/src/model.py: Siamese network architecture used by the deployment bundle
demo/src/dataset.py: Dataset preparation and pair handling for inference
demo/src/data_utils.py: Data loading and preprocessing utilities
demo/src/api.py: FastAPI deployment serving layer
demo/src/deployment.py: Deployment inference pipeline
demo/src/review_store.py: SQLite-backed human review storage
demo/src/ui/index.html: Web UI markup
demo/src/ui/app.js: Client-side duplicate-review interactions
demo/src/ui/styles.css: UI styling

Scripts

Not included in this demo-focused repository snapshot

Configuration and Setup

demo/requirements.txt: Deployment dependencies
demo/Dockerfile: Container build for deployment
demo/docker-compose.yml: One-command local demo

Documentation

README.md: Project overview, methods, and results
demo/README.md: Demo bundle usage and runtime details

Tests

cd demo && pytest -q

Coverage includes pair integrity checks, split correctness (entity-disjoint leakage), model I/O shape checks, metric correctness vs. scikit-learn, and end-to-end smoke tests.

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
deliverables		deliverables
demo		demo
notebooks		notebooks
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fragmented ID Resolution: Using Deep Learning to Match Noisy Identity Records

Contents

Introduction

Dataset

Methods and Models

Why Simple Thresholding Fails

Models

Siamese Network Architecture

Hard-Example Mining

Results

Detailed Comparison on Blocked Extended-Attribute Setting

Deployment-Default Model Metrics

Key Performance Indicators (KPIs)

Model Performance KPIs

Business KPIs

Challenges

Deployment

LinkID — Identity Matching Simplified

Files

Core Source Code

Scripts

Configuration and Setup

Documentation

Tests

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Fragmented ID Resolution: Using Deep Learning to Match Noisy Identity Records

Contents

Introduction

Dataset

Methods and Models

Why Simple Thresholding Fails

Models

Siamese Network Architecture

Hard-Example Mining

Results

Detailed Comparison on Blocked Extended-Attribute Setting

Deployment-Default Model Metrics

Key Performance Indicators (KPIs)

Model Performance KPIs

Business KPIs

Challenges

Deployment

LinkID — Identity Matching Simplified

Files

Core Source Code

Scripts

Configuration and Setup

Documentation

Tests

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages