Three of the hardest challenges I have faced with using AI for land title research have been and **low resolution scan quality**, **variation in index data**, and **(human) name matching**. This post shows a strategy for clustering (human) names together for the intent to derive the variations of names observed in the index to same person. The variation in index data is a factor here as well. There are **100s** of counties and **1000s** of county clerks over **100s** of years and each clerk had their own capacity for accuracy or inaccuracy, detail or lack of detail. Likewise variability in the instruments themselves, they may contain errors, mispellings, etc.
![[Blog/Assets/nm-oil-and-gas-prospecting-permit-bad-scan.png]]
**Figure 01: Example of a poor scan in land title records**
**Objective**
Implement an **automated system** that normalizes, clusters, and resolves name variations (e.g., maiden names, abbreviations, nicknames) in land and mineral rights records. This ensures **accurate title research** by **eliminating ambiguity** in historical documents and their respective indexes.
### Example
I’ve re-displayed the **name similarity matrices** for **Charles David Turner** and **Linda Rebecca Porter**, showing **how different name variations match up**.
![[name-similarity-cd-turner.png]]
![[name-simularity-linda-porter.png]]
**How to Interpret the Results**
1️⃣ **High Similarity Scores (0.85 - 1.0)**
• **Strong match** → Likely the same person.
• Example: "Becky Porter, Attorney" **(0.96 match)** with "Becky Porter".
• "Linda Turner Porter Johnson" **(0.96 match)** with "Linda Rebecca Porter" (captures name changes).
2️⃣ **Medium Similarity Scores (0.70 - 0.85)**
• **Probable match, but needs verification**.
• Example: "David Charles Turner" **(0.78 match)** with "Charles David Turner".
• "Rebecca Porter" **(0.86 match)** with "Linda Rebecca Porter".
3️⃣ **Low Similarity Scores (< 0.70)**
• **Possible, but not strong enough for automatic linking**.
• Example: "Becky Turner-Porter" **(0.65 match)** with "Linda Rebecca Porter".
• "Linda P. Johnson" **(0.64 match)** with "Linda Rebecca Porter".
**How This Works in an AI System**
• **Threshold-Based Matching**
• If **similarity > 0.90**, automatically **link the names**.
• If **0.75 < similarity < 0.90**, **flag for manual review**.
• If **< 0.75**, treat as a **different person**.
• **Graph Database for Name Tracking**
• "Linda Turner Porter Johnson" could be stored in **Neo4j** as an alias for "Linda Rebecca Porter".
• **Vector Search for Instant Lookups**
• If a title record mentions "Becky Porter", the system **automatically searches for** "Rebecca Porter" using vector embeddings.
### **1. Overview: How the System Works**
We use **state-of-the-art AI techniques** to:
✅ **Extract and standardize names** from documents
✅ **Cluster similar names together** using vector embeddings
✅ **Resolve ambiguities** (e.g., "Linda Porter" vs. "Becky Porter")
✅ **Track historical name changes** (e.g., maiden names, legal name changes)
✅ **Store results efficiently for fast lookup**
### **2. Data Ingestion & Name Standardization**
Before we apply AI, we need **consistent formatting**.
**Techniques Used:**
• **Text Cleaning** (Lowercasing, whitespace trimming, punctuation removal)
• **Phonetic Encoding** (Soundex, Metaphone, Double Metaphone)
• **Lexical Matching** (Handling initials and reordered names)
• **Fuzzy Matching** (Levenshtein Distance for typo correction)
**Example Python Code**
``` python
import unicodedata
import re
def normalize_name(name):
"""Standardizes names by removing special characters, extra spaces, and accents."""
name = name.lower().strip()
name = unicodedata.normalize('NFKD', name).encode('ASCII', 'ignore').decode()
name = re.sub(r'[^\w\s]', '', name) # Remove punctuation
return name
print(normalize_name("Linda R. Porter, Esq.")) # Output: "linda r porter esq"
```
### **3. Name Similarity Using AI Embeddings**
Instead of manually defining rules for **nicknames, abbreviations, and initials**, we use **pre-trained NLP models** to generate **name embeddings**.
**State-of-the-Art Models**
| **Model** | **Strengths** | **Best For** |
| ---------------------------- | ---------------------------------- | ------------------------------ |
| all-mpnet-base-v2 | General text similarity | Legal names, common variations |
| nomic-ai/nomic-embed-text-v1 | State-of-the-art entity resolution | Complex name disambiguation |
| microsoft/deberta-v3-base | Context-aware embeddings | Tracking names over time |
**Python Code to Generate Name Embeddings**
``` python
from sentence_transformers import SentenceTransformer, util
# Load the best model for entity resolution
model = SentenceTransformer('nomic-ai/nomic-embed-text-v1')
# Example names (real-world variations)
names = [
"Linda Rebecca Porter", "Linda Porter", "Becky Porter",
"Becky Porter, Attorney", "Linda Turner Porter Johnson",
"Linda R. Porter", "Rebecca Porter", "L. R. Porter",
"Linda P. Johnson", "Becky Turner-Porter"
]
# Encode names into high-dimensional vector representations
name_embeddings = model.encode(names, convert_to_tensor=True)
# Compute similarity scores
similarity_matrix = util.cos_sim(name_embeddings, name_embeddings)
# Display results
import pandas as pd
df = pd.DataFrame(similarity_matrix.cpu().numpy(), index=names, columns=names)
print(df)
```
**🔹 What This Does:**
• Generates **vector representations** for each name.
• Computes **cosine similarity** between names.
• The closer the score is to **1.0**, the **more similar** the names are.
### **4. Clustering Names Using AI**
After computing similarity scores, we **group similar names together**.
**Best Clustering Techniques**
|**Algorithm**|**Strengths**|**Use Case**|
|---|---|---|
|**DBSCAN**|No need to predefine # of clusters|Grouping similar names|
|**Agglomerative Clustering**|Hierarchical, explainable|Small datasets|
|**Graph-Based Clustering (Neo4j)**|Tracks name changes|Complex relationships|
**Python Code: DBSCAN Clustering**
``` python
from sklearn.cluster import DBSCAN
import numpy as np
# Convert similarity matrix into a NumPy array
sim_array = similarity_matrix.cpu().numpy()
# Apply DBSCAN to cluster similar names
clustering = DBSCAN(eps=0.85, min_samples=2, metric="precomputed").fit(1 - sim_array)
# Print clusters
for cluster_id in set(clustering.labels_):
cluster_members = [names[i] for i in range(len(names)) if clustering.labels_[i] == cluster_id]
print(f"Cluster {cluster_id}: {cluster_members}")
```
### **5. Resolving Ambiguities with Graph Databases**
For legal records, name changes over time **must be tracked** (e.g., after marriage). **Graph databases (Neo4j or Oracle Graph)** allow us to store **alias relationships**.
**Example: Graph Schema for Name Tracking**
```
CREATE TABLE name_aliases (
party_id NUMBER PRIMARY KEY,
alias_name VARCHAR2(255),
relationship_type VARCHAR2(50) -- e.g., "Maiden Name", "Nickname"
);
```
**Querying Neo4j for Name Resolutions**
```
MATCH (p:Person)-[:ALIAS_OF]->(a:Person)
WHERE a.name = "Linda Rebecca Porter"
RETURN p.name
```
### **6. Storing & Querying Name Embeddings in a Vector Database**
For **fast lookups**, store name embeddings in **Oracle Vector Search** or **FAISS**.
**SQL: Storing Name Embeddings in Oracle**
``` sql
ALTER TABLE parties ADD (name_embedding VECTOR(384));
INSERT INTO parties (party_id, normalized_name, name_embedding)
VALUES (1001, 'Charles David Turner', VECTOR(384, ARRAY[...]));
```
**SQL: Finding Similar Names**
``` sql
SELECT party_id, normalized_name
FROM parties
WHERE VECTOR_SIMILARITY(name_embedding, [query_embedding]) > 0.85;
```
**7. Handling Partial Matches & Confidence Scores**
For **uncertain cases**, apply **threshold-based confidence scoring**.
|**Confidence Score**|**Action**|
|---|---|
|**0.90 - 1.00**|Auto-link names|
|**0.75 - 0.89**|Queue for review|
|**Below 0.75**|Treat as a separate person|
**Python Code for Confidence-Based Matching**
``` python
def get_high_confidence_matches(similarity_matrix, threshold=0.85):
"""Returns high-confidence name matches"""
matches = []
for i in range(len(names)):
for j in range(i+1, len(names)):
if similarity_matrix[i, j] > threshold:
matches.append((names[i], names[j], similarity_matrix[i, j]))
return matches
# Find matches above 85% confidence
high_conf_matches = get_high_confidence_matches(similarity_matrix.cpu().numpy(), threshold=0.85)
print(high_conf_matches)
```
### **Final Workflow Summary**
1️⃣ **Extract Names from Legal Documents**
2️⃣ **Normalize Names (Lowercase, Remove Punctuation, Expand Initials)**
3️⃣ **Compute AI-Based Similarity Scores (Embeddings + Cosine Similarity)**
4️⃣ **Cluster Names Using DBSCAN or Graph-Based Clustering**
5️⃣ **Track Name Changes Over Time Using Graph Databases**
6️⃣ **Store Name Embeddings in Oracle Vector Search or FAISS**
7️⃣ **Apply Confidence Scoring to Handle Partial Matches**
### Alternatives
TMTOWDI: There's more than one way to do it:
- **Process Entire Population**:
Another way to do this is to use an AI Vision and LLM interpretation model to extract and process data from every instrument. The goal here is to improve the quality and the detail of the index data. Each county may have 1,000,000s of instruments, totaling Trillions of tokens. As most models have a cost, either as a service or hosting, this strategy is cost prohibitive for the moment.
- **Machine Learning Techniques**
With python libraries (TBD).
- **Regular Expressions**
- **Elastic Search**
Here's an example for elastic search, in this case, we use the county index, with all its accuracy/inaccuracy detail/lack of:
``` json
{
"query": {
"bool": {
"filter": [
{ "terms": { "County": ["EREHWON (NM))"] } },
{
"bool": {
"should": [
{ "match_phrase": { "PartyNames": "PORTER LINDA HALL" } },
{ "match_phrase": { "PartyNames": "PORTER LINDA REBECCA HALL" } },
{ "match_phrase": { "PartyNames": "PORTER LINDA R" } },
{ "match_phrase": { "PartyNames": "PORTER LINDA REBECCA" } },
{ "match_phrase": { "PartyNames": "PORTER LINDA ROBECCA" } },
{ "match_phrase": { "PartyNames": "PORTER LINDA REBECCA HALL ATTORNEY" } },
{ "match_phrase": { "PartyNames": "PORTER LINDA REBECCA HALL INDIV & ATTORNEY" } },
{ "match_phrase": { "PartyNames": "PORTER BECKY HALL" } },
{ "match_phrase": { "PartyNames": "PORTER BECKY HALL ATTORNEY" } },
{ "match_phrase": { "PartyNames": "PORTER BECKY HALL INDIV & ATTORNEY" } },
{ "match_phrase": { "PartyNames": "PORTER BECKY HALL TRUSTEE" } },
{ "match_phrase": { "PartyNames": "PORTER BECKY" } },
{ "match_phrase": { "PartyNames": "PORTER BECKY S" } },
{ "match_phrase": { "PartyNames": "PORTER BECKY SHARP" } }
],
"minimum_should_match": 1
}
}
]
}
},
...
}
```