Skip to content

Deduplication Service Specification

Purpose

This document specifies the deduplication service interface, ASySD algorithm integration approach, confidence model, canonical enrichment rules, and bibliographic consolidation rules for already-reviewed duplicate pairs. It is a binding constraint on Phase 12 implementation.

The dedup service is responsible for detecting and resolving duplicate citations within a project and across projects via the Publication entity. It produces the data required for PRISMA box 3 (duplicates removed, excluded by automation, excluded other) and enforces the system invariant that duplicate/merged studies never appear in stage study pools.

Normative language: "MUST" indicates an absolute requirement. "SHALL" indicates mandatory behavior. "SHOULD" indicates a strong recommendation. "MAY" indicates optional behavior.

Algorithm reference: ASySD -- Hair et al. (2023), BMC Biology 21, 189; ASySD R package (github.com/camaradesuk/ASySD). SyRF implements this algorithm natively in C# (see Section 3). CAMARADES (the ASySD maintainer) and SyRF share the same organisation, so algorithm improvements flow in both directions.

1. Service Overview

The deduplication service detects and resolves duplicate citations within a project and across projects via the Publication entity (see three-level-data-model.md Section 2).

Entry Points

The service has two entry points:

  1. Import-time deduplication: Triggered automatically when new citations are imported via a systematic search. Runs as a two-stage pipeline:
  2. Stage 1 (synchronous): DOI/PMID exact match against pmPublication. Completes as part of the import operation. Matched records are resolved immediately (lifecycleStatus = Active). Unmatched records are created with lifecycleStatus = PendingDedupCheck — visible to the admin but excluded from stage pools. The import saga completes here; citations are visible to the user immediately.
  3. Stage 2 (asynchronous): Full fuzzy matching (blocking, Jaro-Winkler scoring, classification, grouping) runs as a background job after Stage 1 completes. Triggered once per import event — not per batch. Resolves PendingDedupCheck studies to Active, Duplicate, or PendingDuplicateReview.
  4. Retroactive deduplication: Triggered by a project administrator to review existing studies for duplicates that were missed at import time (e.g., if a project predates the dedup service deployment). Runs the full fuzzy matching pipeline across all active studies in the project.

Invariants

  1. The service SHALL NEVER auto-merge studies that have review data (screening decisions or annotation sessions). Admin MUST always decide for reviewed studies.
  2. All deduplication decisions (automatic or admin-reviewed) SHALL be auditable -- logged with confidence level, decision source, timestamp, and match details.
  3. Citations SHALL NEVER be deleted or modified by the dedup service. The original import data MUST be preserved to enable per-source PRISMA counting.
  4. Studies with lifecycleStatus IN (Duplicate, Merged, PendingDedupCheck) MUST NOT appear in any stage study pool. This is enforced at the query/filter level, not by deletion.

2. ASySD Algorithm Specification

2.1 Algorithm Summary

ASySD (Automated Systematic Search Deduplication) uses a 4-round blocking strategy with Jaro-Winkler string comparison across 10 bibliographic fields. It produces a two-tier output: automatically confirmed duplicates (high confidence) and probable duplicates (requiring manual review).

Performance characteristics:

Dataset Size Processing Time Sensitivity Specificity
1,845 citations < 1 min 0.95--0.998 >0.999
79,880 citations < 1 hour 0.95--0.998 >0.999

2.2 Blocking Rounds

Each round uses different field combinations as blocking criteria. Fields in a blocking criterion MUST match 100% (exact match after normalization) to form a candidate pair for string comparison.

Round Blocking Criteria (OR-combined within round)
1 (Title AND Pages) OR (Title AND Author) OR (Title AND Abstract) OR DOI
2 (Author AND Year AND Pages) OR (Journal AND Volume AND Pages) OR (ISBN AND Volume AND Pages) OR (Title AND ISBN)
3 (Year AND Pages AND Volume) OR (Year AND Issue AND Volume) OR (Year AND Pages AND Issue)
4 (Author AND Year) OR (Title AND Year) OR (Title AND Volume) OR (Title AND Journal)

Preprocessing: Before blocking, all fields are normalized:

  • Missing/anonymous authors renamed to "Unknown"
  • DOI format harmonized (lowercase, prefix removed)
  • Punctuation removed
  • All text converted to uppercase

Candidate pair formation: A pair of records is formed if they match on ANY blocking criterion in ANY round. Each round is progressively less restrictive.

String comparison: For each candidate pair, Jaro-Winkler similarity is computed across all 10 fields (Title, Author, Year, Journal, ISBN, Abstract, DOI, Issue, Pages, Volume) by the C# JaroWinklerScorer (see Section 3.3).

Graph-based grouping: Duplicate groups are assigned via transitive closure -- if A=B and B=C, then A, B, and C form one group (generate_dup_id() function).

2.3 Required Input Fields

The following fields are required for ASySD processing. Each maps to a specific SyRF entity field.

# ASySD Field Required Used In SyRF Source Entity SyRF Field
1 record_id Auto-generated if absent Unique identifier Citation _id
2 author Yes Blocking rounds 1, 2, 4; string comparison Citation rawAuthors
3 year Yes Blocking rounds 2, 3, 4; string comparison Citation rawYear
4 journal Yes Blocking rounds 2, 4; string comparison Citation rawJournal
5 doi Yes Blocking round 1; string comparison Citation rawDoi
6 title Yes Blocking rounds 1, 2, 4; string comparison Citation rawTitle
7 pages Yes Blocking rounds 1, 2, 3; string comparison Citation rawPages
8 volume Yes Blocking rounds 2, 3, 4; string comparison Citation rawVolume
9 number (issue) Yes Blocking round 3; string comparison Citation rawIssue
10 abstract No (improves results) Blocking round 1; string comparison Citation rawAbstract
11 isbn No (improves results) Blocking round 2; string comparison Citation rawIsbn
12 label No (optional) User tagging; retention priority SystematicSearch sourceType.ToString()
13 source No (optional) Database origin tracking SystematicSearch sourceName

2.4 Field Mapping (ASySD to SyRF)

This table defines the exact mapping between ASySD input fields and SyRF entity fields. Phase 12 implementation MUST use these mappings when constructing the input for the C# deduplication algorithm.

ASySD Field SyRF Entity.Field Notes
record_id Citation._id Guid serialized as string
title Citation.rawTitle Raw title as imported
author Citation.rawAuthors Raw author string as imported
year Citation.rawYear String format preserved
journal Citation.rawJournal Raw journal name as imported
doi Citation.rawDoi Raw DOI as imported
pages Citation.rawPages Raw page range as imported
volume Citation.rawVolume Raw volume as imported
number Citation.rawIssue ASySD uses "number" for what SyRF calls "issue"
abstract Citation.rawAbstract Raw abstract as imported
isbn Citation.rawIsbn Raw ISBN/ISSN as imported
source SystematicSearch.sourceName Name of the search source (e.g., "PubMed")
label SystematicSearch.sourceType.ToString() Source type enum as string (e.g., "Database")

3. Integration Architecture

3.1 Native C# Implementation via MassTransit Consumer

The import pipeline already uses MassTransit saga orchestration for study parsing. The dedup service is integrated as a two-stage step after citation parsing and Citation creation:

  • DedupStage1Consumer: Handles DOI/PMID exact matching synchronously within the import saga. Fast; scales linearly with batch size via MongoDB index lookups.
  • DedupStage2Consumer: Handles fuzzy matching asynchronously after all Stage 1 batches complete. Decoupled from the import saga; runs once per import event.

Both consumers implement IDeduplicationService (Section 5.3) and are independently unit-testable.

Why native C# (not R subprocess): The ASySD algorithm is implemented directly in C# for the following reasons:

  • No runtime dependency: No R installation required in the Docker image.
  • Full observability: .NET distributed tracing, structured logging, and exception handling flow through the dedup logic without crossing a process boundary.
  • Parallelism: PLINQ handles the embarrassingly parallel scoring step (each candidate pair is independent) natively with .NET thread pools.
  • Algorithm ownership: CAMARADES (the ASySD maintainer and SyRF share the same organisation. There is no divergence risk, and improvements discovered during SyRF implementation can flow back to the R package.
  • Straightforward port: The algorithm is standard Jaro-Winkler + four blocking rounds + 25 boolean classification rules + Union-Find grouping — each component is well-understood and unit-testable.

The IDeduplicationService abstraction (Section 5.3) isolates the implementation. Future algorithmic improvements require only changing the implementation, not any consumers or saga logic.

3.2 Sequence Diagram

Stage 1 — Synchronous (import saga waits; completes before returning to user)

ImportSaga                 DedupStage1Consumer
    |                             |
    |  Stage1DedupRequest         |
    |  (BatchId, ProjectId,       |
    |   RecordCount)              |
    |--------------------------->|
    |                             |  Load batch from pmDedupBatch
    |                             |
    |                             |  DOI/PMID exact-match against pmPublication
    |                             |  → Create/update Publications (matched records)
    |                             |  → Create Studies: lifecycleStatus = Active (matched)
    |                             |  → Create Studies: lifecycleStatus = PendingDedupCheck (unmatched)
    |                             |  → Publish Stage2DedupRequest (if any unmatched)
    |                             |  → Delete pmDedupBatch document
    |                             |
    |  Stage1Complete             |
    |<---------------------------|
    (import returns; citations visible to user immediately)

Stage 2 — Asynchronous (background; user does not wait)

                          DedupStage2Consumer
                                 |
    Stage2DedupRequest           |
    (ProjectId,                  |
     SystematicSearchId)         |
    ---------------------------->|
                                 |  Fetch PendingDedupCheck studies (new batch)
                                 |  + Active studies without publicationId
                                 |  (via StudyRepository)
                                 |
                                 |  CitationNormalizer
                                 |  (uppercase, strip punctuation, DOI normalisation)
                                 |
                                 |  BlockingEngine
                                 |  (4 rounds → candidate pairs)
                                 |
                                 |  JaroWinklerScorer
                                 |  (score 10 fields per pair, PLINQ parallel)
                                 |
                                 |  MatchClassifier
                                 |  (25 boolean rules → TrueMatch / ProbableDuplicate)
                                 |
                                 |  DuplicateGrouper
                                 |  (Union-Find transitive closure)
                                 |
                                 |  Create/update Publications (PublicationRepository)
                                 |
                                 |  Link Citations, set lifecycleStatus
                                 |  Active | Duplicate | PendingDuplicateReview
                                 |  (StudyRepository)
                                 |
                                 |  Record all decisions (DedupAuditLog)

3.3 C# Algorithm Components

The DedupConsumer orchestrates the following components, each independently unit-testable:

CitationNormalizer — equivalent to ASySD's format_citations(): - Converts all fields to uppercase - Removes punctuation from title, abstract, year - Normalises DOI: strips https://doi.org/ prefix, lowercases - Normalises page ranges: --- - Treats null/empty author as "UNKNOWN"

BlockingEngine — equivalent to ASySD's compare.dedup() (4 rounds): - Groups records by blocking key combinations (see Section 2.2) - Produces candidate pairs (record_id pairs that share at least one blocking key) - Deduplicates pairs across rounds

JaroWinklerScorer — equivalent to RecordLinkage::jarowinkler(): - Standard Jaro-Winkler: matching window floor(max_len / 2) - 1, Winkler prefix bonus capped at 4 characters with scaling factor 0.1 - Scores 10 fields per candidate pair: author, title, abstract, year, pages, number, volume, journal, isbn, doi - Uses Parallel.ForEach (or PLINQ) over candidate pairs for throughput - Missing-field handling: both-null pairs score 0 for doi/abstract/year/journal/isbn; 1 for pages/volume/number (matching ASySD's NA handling)

MatchClassifier — equivalent to ASySD's identify_true_matches(): - Applies 25 OR-combined boolean threshold rules to produce TrueMatch or ProbableDuplicate - Applies DOI mismatch filter (low-DOI pairs with mismatching DOIs are demoted to ProbableDuplicate) - Applies year mismatch filter (year difference > 1 demotes to ProbableDuplicate) - See Section 2.2 for the full rule set

DuplicateGrouper — equivalent to ASySD's generate_dup_id(): - Union-Find (disjoint set) over confirmed pairs to compute transitive closure - Selects canonical record per group using retention priority: keep_source preference > keep_label preference > lowest record index - Returns DuplicateGroup list (see Section 5.2)

3.4 Parallelism and Scaling

The scoring step (Jaro-Winkler across 10 fields per candidate pair) is embarrassingly parallel — each pair is independent. PLINQ handles this within the JaroWinklerScorer with no additional infrastructure.

Concern Approach
Pair scoring throughput Parallel.ForEach / PLINQ on candidate pairs
Large batches 1,000-record batch limit for Stage 1; Stage 2 processes all PendingDedupCheck studies at once (see Section 3.5)
Concurrent batches DedupStage1Consumer is stateless per batch; Stage 2 runs once after all Stage 1 batches complete, eliminating inter-batch race conditions
Memory Blocking keys are computed in-memory; 1,000 records × 10 fields is well within heap limits

The service interface contract (Section 5) is abstracted from the implementation. Future optimisations (e.g., pre-computed blocking indexes, approximate nearest-neighbour indexes for the fuzzy step) require only changing the implementation behind IDeduplicationService.

3.5 Large Batch Handling

Systematic searches can import tens of thousands of citations. The two-stage pipeline (Section 3.1) addresses the core scalability concern: Stage 1 (DOI/PMID matching) is fast and synchronous; Stage 2 (fuzzy matching) is asynchronous and decoupled from the import saga. This section specifies how large imports are handled within that architecture.

Batch size limit. Each Stage1DedupRequest SHALL cover at most 1,000 records (MAX_DEDUP_BATCH_SIZE). The import saga MUST split larger imports into multiple batches, each producing a separate message. Multiple Stage 1 batches from the same systematic search MAY run concurrently — each is stateless with respect to the others.

Staging collection. Rather than embedding records in the message, the import saga SHALL write each batch to a pmDedupBatch staging document in MongoDB before sending the Stage1DedupRequest. The message carries only a BatchId reference. The DedupStage1Consumer loads the batch, processes it, and deletes the staging document on completion. A TTL index on pmDedupBatch.expiresAt (7-day expiry) ensures orphaned documents are cleaned up if the consumer crashes.

Stage 2 triggering. After all Stage 1 batches complete for a systematic search, the import saga publishes a single Stage2DedupRequest. Stage 2 fetches all PendingDedupCheck studies from the project and runs the full fuzzy matching pipeline once over them. Because Stage 2 sees all new records together, there is no inter-batch race condition — two records that would be duplicates of each other are in the same Stage 2 input set.

Stage 2 incremental dedup. The DedupStage2Consumer includes in its input:

  1. All PendingDedupCheck studies from the new import.
  2. Existing Active studies that lack a publicationId (no authoritative DOI/PMID). Studies that already have a publicationId do not need to be included — if the new record were a duplicate of one, Stage 1 would have caught it via the shared Publication's DOI/PMID.

This means Stage 2 operates on a much smaller existing-study set than a naive "all Active studies" approach: only studies that previously had no identifier are candidates for fuzzy-match duplication with the new batch.

Stage 2 frequency. For MVP, Stage 2 is triggered immediately after all Stage 1 batches for a systematic search complete. A future optimisation MAY coalesce multiple concurrent imports into a single Stage 2 run if several systematic searches complete within a short time window.

4. Confidence Model

4.1 Two-Tier Classification

The MatchClassifier component (Section 3.3) classifies candidate pairs into two tiers:

  • AutoConfirmed: High-confidence duplicates identified by ASySD's internal heuristic filters. These are pairs where the match evidence is strong enough for automatic resolution. For studies with no review data, these are auto-merged. For studies with review data, these are flagged for admin confirmation (see Section 7, Bibliographic Consolidation Rules).

  • ProbableDuplicate: Possible matches that ASySD flags for manual review. The match evidence is suggestive but insufficient for automatic resolution. These are ALWAYS queued for admin review regardless of review data status.

4.2 Threshold Configuration

The MatchClassifier's 25 boolean threshold rules are NOT exposed as user-configurable parameters. The two-tier output (AutoConfirmed vs. ProbableDuplicate) provides sufficient admin oversight for MVP.

Future iteration: If project administrators request sensitivity tuning, a configurable parameter MAY be added that shifts pairs between AutoConfirmed and ProbableDuplicate tiers by adjusting the threshold rules in MatchClassifier, without changing the blocking or scoring logic.

5. Service Interface Contract

Phase 12 MUST implement the following types. These types define the boundary between the dedup service and the import saga / admin review queue.

5.1 Input Types

Stage1DedupRequest {
    ProjectId: Guid
    SystematicSearchId: Guid
    BatchId: Guid                    // References a pmDedupBatch staging document
    RecordCount: int                 // Count of records in batch (for progress tracking only)
    TotalBatchCount: int             // Total number of Stage 1 batches for this search
    BatchIndex: int                  // 0-based index of this batch (saga uses to detect completion)
}

Stage2DedupRequest {
    ProjectId: Guid
    SystematicSearchId: Guid
    IsRetroactive: bool              // true for admin-triggered retroactive dedup
}

DedupBatch {                         // Stored in pmDedupBatch staging collection
    _id: Guid                        // = BatchId
    ProjectId: Guid
    SystematicSearchId: Guid
    Records: List<CitationForDedup>
    CreatedAt: DateTime
    ExpiresAt: DateTime              // TTL index (7 days) for orphan cleanup
}

CitationForDedup {
    CitationId: Guid
    Title: string
    Authors: string
    Year: string
    Journal: string
    Doi: string
    Pages: string
    Volume: string
    Issue: string
    Abstract: string
    Isbn: string
    SourceName: string
    SourceType: string
}

5.2 Output Types

DeduplicationResult {
    AutoConfirmedGroups: List<DuplicateGroup>
    ProbableGroups: List<DuplicateGroup>
    UniqueRecordIds: List<Guid>
    ProcessingTimeMs: long
    AlgorithmVersion: string  // SyRF dedup algorithm version (e.g., "1.0.0"); bumped when matching rules change
}

DuplicateGroup {
    GroupId: Guid
    CanonicalRecordId: Guid           // Best record chosen by retention rules
    MemberRecordIds: List<Guid>       // All records in the group (including canonical)
    Confidence: DuplicateConfidence   // AutoConfirmed | ProbableDuplicate
    MatchDetails: MatchSummary
}

MatchSummary {
    MatchingFields: List<FieldMatch>  // Which fields matched and similarity scores
    BlockingRound: int                // Which round formed the candidate pair (1-4)
}

FieldMatch {
    FieldName: string                 // e.g., "title", "author", "doi"
    SimilarityScore: double           // 0.0-1.0 Jaro-Winkler similarity
}

enum DuplicateConfidence {
    AutoConfirmed = 0,
    ProbableDuplicate = 1
}

5.3 Service Abstraction

interface IDeduplicationService {
    Task<DeduplicationResult> DeduplicateAsync(
        DeduplicationRequest request,
        CancellationToken cancellationToken
    );
}

This interface abstracts the deduplication implementation. The C# implementation is the only supported mechanism. Future algorithmic improvements (e.g., updated blocking rules, alternative similarity metrics) require only changing the implementation behind this interface, not any consumers or saga logic.

6. Canonical Enrichment Rules

When multiple Citations are confirmed as duplicates and linked to the same Publication, the canonical metadata fields on the Publication SHALL be populated using best-of-breed selection. These rules align with the merge_citations = TRUE behavior documented in Hair et al. (2023).

6.1 Field-by-Field Selection Rules

Field Selection Rule Rationale
canonicalTitle Prefer longest non-empty title Longer titles may include subtitles; more complete information
canonicalAuthors Prefer longest author list (by author count) More complete authorship record
canonicalAbstract Prefer longest non-empty abstract More complete information for screening
canonicalYear Prefer explicit 4-digit year Avoid parsed or estimated years from inconsistent sources
canonicalJournal Prefer full journal name over abbreviation (longer string) Full names are more readable and less ambiguous
canonicalPages Prefer complete page range (contains "-") Full page span is more informative than a single page number
canonicalVolume Prefer non-empty value Any volume data is better than none; no further ranking
canonicalIssue Prefer non-empty value Any issue data is better than none; no further ranking
canonicalIsbn Prefer non-empty value Any ISBN/ISSN is better than none; no further ranking
doi Prefer non-empty DOI; if multiple non-empty, prefer PubMed-sourced DOI PubMed DOIs are the most reliably formatted and verified

6.2 Provenance Tracking

Each canonical field on the Publication MUST record which Citation (and therefore which search/source) provided it. This enables audit trail and cross-project attribution.

MetadataProvenance {
    FieldName: string              // Name of the canonical field (e.g., "canonicalTitle")
    SourceCitationId: Guid     // Citation that provided this field value
    SourceProjectId: Guid          // Project that owns the source Citation
    SelectedAt: DateTime           // When this field was selected/updated
}

The Publication.metadataProvenance[] array (see three-level-data-model.md Section 2) MUST contain one entry per canonical field that has been populated.

6.3 Cross-Project Enrichment

When a new Citation from Project B links to a Publication already known from Project A, the canonical fields on the Publication MAY be updated if Project B's Citation has better metadata per the selection rules in Section 6.1.

Rules: 1. The enrichment update is automatic (no admin approval required). 2. The MetadataProvenance entry for the updated field MUST be updated to reflect the new source. 3. The previous provenance is overwritten (not preserved as history). The audit log in the DedupAuditLog captures the change event. 4. Project A's view of the Publication is immediately enriched by Project B's data (and vice versa). This is an intentional design choice: Publications accumulate the best metadata from all projects.

7. Bibliographic Consolidation Rules for Already-Reviewed Duplicates

These five scenarios define the exact behavioral rules for handling duplicates based on their review state. The scenarios are mutually exclusive and exhaustive for all duplicate pair cases.

7.1 Scenario Table

# Scenario Automation Level Action Study.lifecycleStatus
1 High-confidence, neither study reviewed Auto-confirm Merge into single Study: retain all Citations on canonical Study, delete secondary Study Canonical: Active; secondary: Duplicate
2 High-confidence, one study reviewed Auto-confirm Keep reviewed Study as canonical, link Citations from unreviewed Study to canonical Canonical: unchanged; unreviewed: Duplicate
3 High-confidence, both studies reviewed (same stage) Admin review required If admin confirms: secondary's annotation/screening sessions become additional candidate sessions for reconciliation on canonical Study Canonical: unchanged; secondary: Merged
4 High-confidence, both studies reviewed (different stages) Admin review required If admin confirms: link both Studies via shared Publication but do NOT merge review data (contextually different stage data) Both: unchanged; linked via Publication (same publicationId)
5 Probable duplicate, any review state Admin review always Queue for admin decision; admin can confirm (triggers applicable rule from scenarios 1-4) or reject (mark as not duplicate) PendingDuplicateReview until resolved

7.2 Scenario Details

Scenario 1: High-confidence, neither reviewed

Both studies have no screening decisions and no annotation sessions. The dedup service auto-confirms the merge:

  1. Select the canonical Study using the DuplicateGrouper retention priority (prefer abstract-bearing record; among ties, prefer more recently imported).
  2. Move all Citations from the secondary Study to the canonical Study's citations[] array.
  3. Set secondary Study's lifecycleStatus = Duplicate.
  4. Set secondary Study's duplicateGroupId to the group identifier.
  5. Update the Publication's canonical metadata using enrichment rules (Section 6.1).
  6. Log to DedupAuditLog with Decision = Confirmed, DecidedBy = System.

Scenario 2: High-confidence, one reviewed

One study has review data; the other does not. The reviewed study MUST be the canonical:

  1. The reviewed Study becomes canonical (regardless of ASySD retention priority).
  2. Move all Citations from the unreviewed Study to the canonical Study's citations[] array.
  3. Set unreviewed Study's lifecycleStatus = Duplicate.
  4. Update the Publication's canonical metadata using enrichment rules (Section 6.1).
  5. Log to DedupAuditLog with Decision = Confirmed, DecidedBy = System.

Scenario 3: High-confidence, both reviewed (same stage)

Both studies have review data in the same stage. Admin MUST decide:

  1. Queue for admin review (see Section 8, Admin Review Queue).
  2. If admin confirms: the canonical Study is the one with more review data (or admin choice).
  3. The secondary Study's annotation sessions and screening decisions become additional candidate sessions for reconciliation on the canonical Study.
  4. Set secondary Study's lifecycleStatus = Merged.
  5. The secondary Study retains all its data but is excluded from stage pools.
  6. Log to DedupAuditLog with Decision = Confirmed, DecidedBy = AdminId.

Key detail for Scenario 3: "Additional candidate sessions for reconciliation" means the secondary study's annotation sessions are treated as if additional annotators had annotated the canonical study. This increases the number of candidate sessions available for reconciliation, potentially improving agreement assessment.

Scenario 4: High-confidence, both reviewed (different stages)

Both studies have review data but in different stages. The review data is contextually different and MUST NOT be merged:

  1. Queue for admin review (see Section 8, Admin Review Queue).
  2. If admin confirms: link both Studies via the same Publication (publicationId points to the same Publication).
  3. Do NOT merge review data, move annotation sessions, or change review state.
  4. Both Studies retain their current lifecycleStatus (no change).
  5. The duplicate is resolved at the bibliographic level (shared Publication) but not at the review level.
  6. Log to DedupAuditLog with Decision = Confirmed, DecidedBy = AdminId.

Scenario 5: Probable duplicate, any state

ASySD classified the pair as probable (not high-confidence). Regardless of review state:

  1. Set both Studies' lifecycleStatus = PendingDuplicateReview.
  2. Queue for admin review (see Section 8, Admin Review Queue).
  3. Admin decides:
  4. Confirm Duplicate: Apply the appropriate rule from scenarios 1-4 based on review state.
  5. Not Duplicate: Dismiss the pair; restore both Studies' lifecycleStatus to their previous values; log rejection.
  6. Defer: Keep in queue for later review.
  7. Log to DedupAuditLog with the admin's decision.

7.3 Key Invariant

NEVER auto-merge studies that have review data. Admin ALWAYS decides for reviewed studies. This is a system invariant that MUST be enforced at the service level, not just at the UI level.

8. Admin Review Queue

8.1 Queue Population

The admin review queue is populated from two sources:

  1. Probable duplicates (Scenario 5): All probable duplicate pairs from ASySD output.
  2. High-confidence duplicates with review data (Scenarios 3, 4): Auto-confirmed duplicates where both studies have review data.

8.2 Queue Item Display

Each queue item MUST show:

Element Description
Candidate pair Both studies with their bibliographic metadata side-by-side
Match details Fields that matched, Jaro-Winkler similarity scores per field, blocking round
Review data summary Which stages each study has been reviewed in, number of screening decisions, number of annotation sessions
Confidence tier AutoConfirmed or ProbableDuplicate
Import source Which systematic search imported each study

8.3 Admin Actions

Action Effect Applicable To
Confirm Duplicate Triggers merge per applicable scenario (1-4); removes pair from queue All queue items
Not Duplicate Dismisses the pair; restores lifecycleStatus to previous value; pair not re-queued All queue items
Defer Keeps pair in queue for later review; no status change All queue items

All admin actions MUST be logged in the DedupAuditLog with the admin's investigator ID and timestamp.

9. Merge Wizard for Reviewed Duplicates

When an admin confirms a duplicate pair where BOTH studies have review data (Scenarios 3 and 4), a merge wizard SHALL guide the resolution.

9.1 Same-Stage Merge (Scenario 3)

The merge wizard presents:

  1. Both studies' metadata side-by-side (title, authors, year, journal).
  2. Both studies' annotation sessions for the shared stage -- showing session count, annotator names (anonymized if blinding is active), and completion status.
  3. Merge preview: Shows the canonical study after merge, with the secondary study's sessions listed as "additional candidate sessions for reconciliation."
  4. Confirmation button: Admin confirms; secondary study's sessions become additional candidates; secondary study set to Merged.

The merge wizard presents:

  1. Both studies' metadata side-by-side.
  2. Stage breakdown: Shows which stages each study has review data in, emphasizing that the stages are different.
  3. Link preview: Shows that both studies will share the same Publication but retain separate review data.
  4. Confirmation button: Admin confirms; studies linked via Publication; no review data moved.

9.3 Post-Merge State

In both cases after admin confirmation:

  • The secondary study's lifecycleStatus is set to Merged (Scenario 3) or remains unchanged (Scenario 4).
  • The secondary study retains ALL its data (Citations, annotation sessions, screening decisions) but is excluded from stage study pools.
  • The canonical study gains the secondary study's Citations (Scenario 3) or is linked via Publication (Scenario 4).

10. Dedup Audit Log

Every deduplication decision (automatic or admin-reviewed) SHALL create an audit record. The audit log provides traceability and enables reversal of incorrect dedup decisions.

10.1 Audit Entry Structure

DedupAuditEntry {
    EntryId: Guid
    ProjectId: Guid
    RecordIdA: Guid                    // First study/Citation in the pair
    RecordIdB: Guid                    // Second study/Citation in the pair
    Confidence: DuplicateConfidence    // AutoConfirmed | ProbableDuplicate
    Decision: DedupDecision            // Confirmed | Rejected | Pending
    DecidedBy: DedupDecider            // System (for auto-confirmed) | AdminId (Guid)
    DecidedAt: DateTime                // When the decision was made
    MatchDetails: MatchSummary         // Fields matched, similarity scores, blocking round
    AlgorithmVersion: string           // SyRF dedup algorithm version (e.g., "1.0.0"); bumped when matching rules change
    Notes: string?                     // Optional admin notes (e.g., reason for rejection)
}

enum DedupDecision {
    Confirmed = 0,      // Duplicate confirmed (auto or admin)
    Rejected = 1,        // Not a duplicate (admin decision)
    Pending = 2          // Awaiting admin review
}

// DedupDecider is a discriminated union:
// - System: automatic decision by the dedup service
// - Admin(InvestigatorId: Guid): manual decision by a project admin

10.2 Audit Log Storage

The DedupAuditLog SHOULD be stored as an array on the project document (embedded) or in a separate audit collection, depending on expected volume. For MVP, embedding on the project document is recommended if the expected number of dedup decisions per project is < 10,000. For larger projects, a separate pmDedupAuditLog collection MAY be created.

Decision deferred to: Phase 12 implementation (storage location).

10.3 Retention

Audit entries SHALL NEVER be deleted. They provide the audit trail required for reproducibility of systematic review methodology.

11. PRISMA Box 3 Integration

The dedup service produces the data required for PRISMA box 3 ("Records removed before screening"). See prisma-flow-diagram-mapping.md Section 3.1, Box 3.

11.1 Derivation Rules

PRISMA Field Derivation Data Source
duplicates COUNT(citations across all studies in project) - COUNT(distinct active Studies after dedup) Study.citations[] count vs. Study count WHERE lifecycleStatus NOT IN (Duplicate, Merged)
excluded_automatic COUNT(studies WHERE lifecycleStatus = RemovedByAutomation) Study.lifecycleStatus
excluded_other COUNT(studies WHERE lifecycleStatus = RemovedOther) Study.lifecycleStatus

11.2 Count Consistency

The following equation MUST hold for any project at any point in time:

total_import_records = unique_active_studies + duplicates_removed + excluded_automatic + excluded_other + pending_review + pending_dedup

Where: - total_import_records = SUM(COUNT(study.citations[])) for all studies in project - unique_active_studies = COUNT(studies WHERE lifecycleStatus NOT IN (Duplicate, Merged, RemovedByAutomation, RemovedOther, PendingDuplicateReview, PendingDedupCheck)) - duplicates_removed = COUNT(studies WHERE lifecycleStatus IN (Duplicate, Merged)) - excluded_automatic = COUNT(studies WHERE lifecycleStatus = RemovedByAutomation) - excluded_other = COUNT(studies WHERE lifecycleStatus = RemovedOther) - pending_review = COUNT(studies WHERE lifecycleStatus = PendingDuplicateReview) - pending_dedup = COUNT(studies WHERE lifecycleStatus = PendingDedupCheck)

12. Duplicate Study Exclusion from Stage Pools

System Invariant

Studies with lifecycleStatus IN (Duplicate, Merged) MUST NOT appear in any stage study pool. This invariant ensures that:

  1. Screeners never screen a study that has already been identified as a duplicate.
  2. Annotators never annotate a study that has been merged into another.
  3. PRISMA counts reflect only unique, active studies in the screening pipeline.

Enforcement

This invariant SHALL be enforced at the query/filter level, not by deletion:

  • All stage study pool queries MUST include a filter: lifecycleStatus NOT IN (Duplicate, Merged, PendingDuplicateReview, PendingDedupCheck, RemovedByAutomation, RemovedOther)
  • PendingDedupCheck excludes studies from pools until Stage 2 fuzzy matching resolves their duplicate status.
  • PendingDuplicateReview excludes studies from pools until the admin resolves the duplicate decision.
  • Studies with these statuses remain in the database with all their data intact, enabling audit, reversal, and PRISMA counting.

Reversal

If an admin later rejects a duplicate decision (via the admin review queue): 1. The study's lifecycleStatus SHALL be restored to its previous value (typically Active). 2. The study SHALL re-enter applicable stage study pools. 3. The DedupAuditLog records the rejection.

13. Cross-References

Requirement Coverage

Requirement ID Coverage in This Document
DEDUP-01 Complete: ASySD algorithm specified with blocking rounds, field mappings, C# implementation components, two-stage pipeline architecture (Stage 1 synchronous DOI/PMID, Stage 2 async fuzzy matching via MassTransit)
DEDUP-02 Complete: Two-tier confidence model (AutoConfirmed / ProbableDuplicate) with import-time triggering
DEDUP-03 Complete: Import pipeline creates Citation per citation, links to Publication; field mapping defined
DEDUP-04 Complete: Study lifecycle status model covers Duplicate, PendingDuplicateReview, Merged statuses
DEDUP-05 Complete: Canonical enrichment rules with field-by-field priority table and provenance tracking
DEDUP-06 Complete: Admin review queue with display requirements, admin actions (Confirm, Reject, Defer)
DEDUP-07 Complete: Merge wizard for same-stage (sessions become reconciliation candidates) and cross-stage (link via Publication) scenarios
DEDUP-08 Complete: Duplicate study exclusion from stage pools as system invariant, enforced at query level
PRISMA-05 Complete: PRISMA box 3 derivation rules for duplicates, excluded_automatic, and excluded_other