Phase 12: Deduplication Service¶

Release 3 -- Clean Data Foundation Phase 12 is the first step in the screening pipeline. Before studies can be screened, duplicate citations must be identified and resolved.

Summary¶

When researchers import citations from multiple bibliographic databases, the same paper often appears multiple times. This phase adds intelligent duplicate detection using the ASySD algorithm -- a tool developed by CAMARADES and trusted in the systematic review community. High-confidence duplicates are handled automatically; uncertain matches are queued for administrator review. The system now distinguishes between the raw import (preserved exactly as imported), the canonical study (enriched with the best metadata), and the global bibliographic reference (shared across all projects).

The Problem¶

Today, SyRF has no duplicate detection. When a team imports citations from PubMed, Embase, and Scopus, the same paper may appear three times in their project. Researchers must manually identify and remove duplicates -- a tedious, error-prone process that wastes screening effort and produces inaccurate PRISMA counts.

Duplicate detection is also the foundation for accurate PRISMA 2020 reporting. The PRISMA flow diagram requires separate counts of total records imported, duplicates removed, and unique studies proceeding to screening. Without automated deduplication, these counts must be computed manually.

What We Are Building¶

The Three-Level Data Model¶

This phase introduces a fundamental change to how SyRF manages bibliographic data, separating it into three distinct levels:

flowchart TD
    subgraph "Level 1: Publication"
        R[Global Bibliographic Identity]
        R -->|"DOI / PMID lookup"| R
    end

    subgraph "Level 2: Citation"
        IR1["Import from PubMed"]
        IR2["Import from Embase"]
        IR3["Import from Scopus"]
    end

    subgraph "Level 3: Study"
        S["Canonical Study<br/>(best metadata from all imports)"]
    end

    IR1 --> S
    IR2 --> S
    IR3 --> S
    S --> R

    style R fill:#e1f5fe
    style S fill:#e8f5e9

Level	What It Represents	Key Property
Publication	A unique piece of research across the entire SyRF system	System-wide, shared across all projects. If two projects import the same paper, they link to the same Publication
Citation	A single citation exactly as imported from a specific source	Immutable -- never modified after creation. Preserves the raw bibliographic data for PRISMA counting
Study	The reviewable entity that annotators and screeners work with	Enriched with the best metadata from all its Citations (Value Objects)

How Deduplication Works¶

When new citations are imported:

Exact matching: The system checks if the new citation's DOI or PubMed ID already exists in the system. If so, the citation is linked to the existing Publication
Fuzzy matching: For citations without exact identifiers, the ASySD algorithm compares bibliographic fields (title, authors, year, journal, abstract, pages, volume, issue) using sophisticated string matching across four rounds of progressively broader comparisons
Automatic resolution: High-confidence duplicates are confirmed automatically -- the duplicate study is removed from screening pools and its import data is preserved on the canonical study
Admin review queue: Uncertain matches are queued for administrator review. Each pair is presented side-by-side with match details, similarity scores, and review data summaries. The administrator confirms, rejects, or defers each pair

The ASySD Algorithm¶

ASySD (Automated Systematic Search Deduplication) was developed by CAMARADES and is published in BMC Biology. It uses a four-round blocking strategy with Jaro-Winkler string comparison across ten bibliographic fields. Performance benchmarks show sensitivity of 0.95-0.998 and specificity above 0.999, processing up to 80,000 citations in under an hour.

The algorithm runs as an R subprocess, using the original ASySD R package directly. This avoids any risk of divergence from the upstream implementation and ensures SyRF benefits from ongoing improvements to the algorithm.

Canonical Enrichment¶

When multiple Citations are confirmed as duplicates, the canonical study is automatically enriched with the best metadata from all sources. For example:

The longest title is preferred (may include subtitle)
The most complete author list is preferred
A non-empty abstract is preferred over an empty one
A PubMed-sourced DOI is preferred (most reliably formatted)

Field-level provenance tracks which Citation provided each canonical value, maintaining full auditability.

The Merge Wizard¶

When duplicate citations have already been reviewed (annotated or screened), the system never auto-merges them. Instead, an administrator is presented with a merge wizard:

Same-stage duplicates: The secondary study's annotation sessions become additional candidate sessions for reconciliation on the canonical study
Cross-stage duplicates: Both studies are linked via their shared Publication but review data remains separate (the review contexts are different)

This protects existing work while still resolving the duplication.

Safety Guarantees¶

Citations are never deleted or modified -- the raw import data is always preserved
Reviewed studies are never auto-merged -- administrators always decide when review data exists
All decisions are auditable -- every dedup decision (automatic or manual) is logged with confidence level, matching details, and decision source
Duplicate studies are excluded from pools -- they remain in the database for audit but never appear in screening or annotation queues

Why This Matters for PRISMA¶

The PRISMA 2020 flow diagram requires accurate counts of records imported, duplicates removed, and unique studies entering screening. The three-level data model makes these counts automatic:

Total records imported: Count of all Citations, grouped by source type
Duplicates removed: Count of studies with Duplicate or Merged lifecycle status
Unique studies screened: Count of Active studies after deduplication

These counts feed directly into the PRISMA diagram generated in Phase 16.

How It Connects¶

Connection	Detail
Phase 2 (PRISMA Specification)	Implements the three-level data model and lifecycle status model specified in Phase 2
Phase 13 (Screening Profiles)	Deduplication runs before screening -- only unique, active studies enter screening pools
Phase 16 (Export and PRISMA)	Dedup counts feed PRISMA box 3 (duplicates removed, excluded by automation, excluded other)
Phase 9 (Reconciliation Model)	Same-stage duplicate merge creates additional candidate sessions for reconciliation

For the platform architecture overview, see platform-architecture.md. For the full deduplication technical specification, see deduplication-service-specification.md.

Phase 12 (dedup) cleans the data. Phase 13 (profiles) configures screening criteria. Phase 14 (filtering) routes studies to stages. Phase 15 (screening) adds structured decisions. Phase 16 (export/PRISMA) delivers the output.