Deduplication Service Specification¶
Purpose¶
This document specifies the deduplication service interface, ASySD algorithm integration approach, confidence model, canonical enrichment rules, and bibliographic consolidation rules for already-reviewed duplicate pairs. It is a binding constraint on Phase 12 implementation.
The dedup service is responsible for detecting and resolving duplicate citations within a project and across projects via the Publication entity. It produces the data required for PRISMA box 3 (duplicates removed, excluded by automation, excluded other) and enforces the system invariant that duplicate/merged studies never appear in stage study pools.
Normative language: "MUST" indicates an absolute requirement. "SHALL" indicates mandatory behavior. "SHOULD" indicates a strong recommendation. "MAY" indicates optional behavior.
Algorithm reference: ASySD -- Hair et al. (2023), BMC Biology 21, 189; ASySD R package (github.com/camaradesuk/ASySD). SyRF implements this algorithm natively in C# (see Section 3). CAMARADES (the ASySD maintainer) and SyRF share the same organisation, so algorithm improvements flow in both directions.
1. Service Overview¶
The deduplication service detects and resolves duplicate citations within a project and across projects via the Publication entity (see three-level-data-model.md Section 2).
Entry Points¶
The service has two entry points:
- Import-time deduplication: Triggered automatically when new citations are imported via a systematic search. Runs as a two-stage pipeline:
- Stage 1 (synchronous): DOI/PMID exact match against
pmPublication. Completes as part of the import operation. Matched records are resolved immediately (lifecycleStatus = Active). Unmatched records are created withlifecycleStatus = PendingDedupCheck— visible to the admin but excluded from stage pools. The import saga completes here; citations are visible to the user immediately. - Stage 2 (asynchronous): Full fuzzy matching (blocking, Jaro-Winkler scoring, classification, grouping) runs as a background job after Stage 1 completes. Triggered once per import event — not per batch. Resolves
PendingDedupCheckstudies toActive,Duplicate, orPendingDuplicateReview. - Retroactive deduplication: Triggered by a project administrator to review existing studies for duplicates that were missed at import time (e.g., if a project predates the dedup service deployment). Runs the full fuzzy matching pipeline across all active studies in the project.
Invariants¶
- The service SHALL NEVER auto-merge studies that have review data (screening decisions or annotation sessions). Admin MUST always decide for reviewed studies.
- All deduplication decisions (automatic or admin-reviewed) SHALL be auditable -- logged with confidence level, decision source, timestamp, and match details.
- Citations SHALL NEVER be deleted or modified by the dedup service. The original import data MUST be preserved to enable per-source PRISMA counting.
- Studies with
lifecycleStatusIN (Duplicate,Merged,PendingDedupCheck) MUST NOT appear in any stage study pool. This is enforced at the query/filter level, not by deletion.
2. ASySD Algorithm Specification¶
2.1 Algorithm Summary¶
ASySD (Automated Systematic Search Deduplication) uses a 4-round blocking strategy with Jaro-Winkler string comparison across 10 bibliographic fields. It produces a two-tier output: automatically confirmed duplicates (high confidence) and probable duplicates (requiring manual review).
Performance characteristics:
| Dataset Size | Processing Time | Sensitivity | Specificity |
|---|---|---|---|
| 1,845 citations | < 1 min | 0.95--0.998 | >0.999 |
| 79,880 citations | < 1 hour | 0.95--0.998 | >0.999 |
2.2 Blocking Rounds¶
Each round uses different field combinations as blocking criteria. Fields in a blocking criterion MUST match 100% (exact match after normalization) to form a candidate pair for string comparison.
| Round | Blocking Criteria (OR-combined within round) |
|---|---|
| 1 | (Title AND Pages) OR (Title AND Author) OR (Title AND Abstract) OR DOI |
| 2 | (Author AND Year AND Pages) OR (Journal AND Volume AND Pages) OR (ISBN AND Volume AND Pages) OR (Title AND ISBN) |
| 3 | (Year AND Pages AND Volume) OR (Year AND Issue AND Volume) OR (Year AND Pages AND Issue) |
| 4 | (Author AND Year) OR (Title AND Year) OR (Title AND Volume) OR (Title AND Journal) |
Preprocessing: Before blocking, all fields are normalized:
- Missing/anonymous authors renamed to "Unknown"
- DOI format harmonized (lowercase, prefix removed)
- Punctuation removed
- All text converted to uppercase
Candidate pair formation: A pair of records is formed if they match on ANY blocking criterion in ANY round. Each round is progressively less restrictive.
String comparison: For each candidate pair, Jaro-Winkler similarity is computed across all 10 fields (Title, Author, Year, Journal, ISBN, Abstract, DOI, Issue, Pages, Volume) by the C# JaroWinklerScorer (see Section 3.3).
Graph-based grouping: Duplicate groups are assigned via transitive closure -- if A=B and B=C, then A, B, and C form one group (generate_dup_id() function).
2.3 Required Input Fields¶
The following fields are required for ASySD processing. Each maps to a specific SyRF entity field.
| # | ASySD Field | Required | Used In | SyRF Source Entity | SyRF Field |
|---|---|---|---|---|---|
| 1 | record_id |
Auto-generated if absent | Unique identifier | Citation | _id |
| 2 | author |
Yes | Blocking rounds 1, 2, 4; string comparison | Citation | rawAuthors |
| 3 | year |
Yes | Blocking rounds 2, 3, 4; string comparison | Citation | rawYear |
| 4 | journal |
Yes | Blocking rounds 2, 4; string comparison | Citation | rawJournal |
| 5 | doi |
Yes | Blocking round 1; string comparison | Citation | rawDoi |
| 6 | title |
Yes | Blocking rounds 1, 2, 4; string comparison | Citation | rawTitle |
| 7 | pages |
Yes | Blocking rounds 1, 2, 3; string comparison | Citation | rawPages |
| 8 | volume |
Yes | Blocking rounds 2, 3, 4; string comparison | Citation | rawVolume |
| 9 | number (issue) |
Yes | Blocking round 3; string comparison | Citation | rawIssue |
| 10 | abstract |
No (improves results) | Blocking round 1; string comparison | Citation | rawAbstract |
| 11 | isbn |
No (improves results) | Blocking round 2; string comparison | Citation | rawIsbn |
| 12 | label |
No (optional) | User tagging; retention priority | SystematicSearch | sourceType.ToString() |
| 13 | source |
No (optional) | Database origin tracking | SystematicSearch | sourceName |
2.4 Field Mapping (ASySD to SyRF)¶
This table defines the exact mapping between ASySD input fields and SyRF entity fields. Phase 12 implementation MUST use these mappings when constructing the input for the C# deduplication algorithm.
| ASySD Field | SyRF Entity.Field | Notes |
|---|---|---|
record_id |
Citation._id |
Guid serialized as string |
title |
Citation.rawTitle |
Raw title as imported |
author |
Citation.rawAuthors |
Raw author string as imported |
year |
Citation.rawYear |
String format preserved |
journal |
Citation.rawJournal |
Raw journal name as imported |
doi |
Citation.rawDoi |
Raw DOI as imported |
pages |
Citation.rawPages |
Raw page range as imported |
volume |
Citation.rawVolume |
Raw volume as imported |
number |
Citation.rawIssue |
ASySD uses "number" for what SyRF calls "issue" |
abstract |
Citation.rawAbstract |
Raw abstract as imported |
isbn |
Citation.rawIsbn |
Raw ISBN/ISSN as imported |
source |
SystematicSearch.sourceName |
Name of the search source (e.g., "PubMed") |
label |
SystematicSearch.sourceType.ToString() |
Source type enum as string (e.g., "Database") |
3. Integration Architecture¶
3.1 Native C# Implementation via MassTransit Consumer¶
The import pipeline already uses MassTransit saga orchestration for study parsing. The dedup service is integrated as a two-stage step after citation parsing and Citation creation:
DedupStage1Consumer: Handles DOI/PMID exact matching synchronously within the import saga. Fast; scales linearly with batch size via MongoDB index lookups.DedupStage2Consumer: Handles fuzzy matching asynchronously after all Stage 1 batches complete. Decoupled from the import saga; runs once per import event.
Both consumers implement IDeduplicationService (Section 5.3) and are independently unit-testable.
Why native C# (not R subprocess): The ASySD algorithm is implemented directly in C# for the following reasons:
- No runtime dependency: No R installation required in the Docker image.
- Full observability: .NET distributed tracing, structured logging, and exception handling flow through the dedup logic without crossing a process boundary.
- Parallelism: PLINQ handles the embarrassingly parallel scoring step (each candidate pair is independent) natively with .NET thread pools.
- Algorithm ownership: CAMARADES (the ASySD maintainer and SyRF share the same organisation. There is no divergence risk, and improvements discovered during SyRF implementation can flow back to the R package.
- Straightforward port: The algorithm is standard Jaro-Winkler + four blocking rounds + 25 boolean classification rules + Union-Find grouping — each component is well-understood and unit-testable.
The IDeduplicationService abstraction (Section 5.3) isolates the implementation. Future algorithmic improvements require only changing the implementation, not any consumers or saga logic.
3.2 Sequence Diagram¶
Stage 1 — Synchronous (import saga waits; completes before returning to user)
ImportSaga DedupStage1Consumer
| |
| Stage1DedupRequest |
| (BatchId, ProjectId, |
| RecordCount) |
|--------------------------->|
| | Load batch from pmDedupBatch
| |
| | DOI/PMID exact-match against pmPublication
| | → Create/update Publications (matched records)
| | → Create Studies: lifecycleStatus = Active (matched)
| | → Create Studies: lifecycleStatus = PendingDedupCheck (unmatched)
| | → Publish Stage2DedupRequest (if any unmatched)
| | → Delete pmDedupBatch document
| |
| Stage1Complete |
|<---------------------------|
(import returns; citations visible to user immediately)
Stage 2 — Asynchronous (background; user does not wait)
DedupStage2Consumer
|
Stage2DedupRequest |
(ProjectId, |
SystematicSearchId) |
---------------------------->|
| Fetch PendingDedupCheck studies (new batch)
| + Active studies without publicationId
| (via StudyRepository)
|
| CitationNormalizer
| (uppercase, strip punctuation, DOI normalisation)
|
| BlockingEngine
| (4 rounds → candidate pairs)
|
| JaroWinklerScorer
| (score 10 fields per pair, PLINQ parallel)
|
| MatchClassifier
| (25 boolean rules → TrueMatch / ProbableDuplicate)
|
| DuplicateGrouper
| (Union-Find transitive closure)
|
| Create/update Publications (PublicationRepository)
|
| Link Citations, set lifecycleStatus
| Active | Duplicate | PendingDuplicateReview
| (StudyRepository)
|
| Record all decisions (DedupAuditLog)
3.3 C# Algorithm Components¶
The DedupConsumer orchestrates the following components, each independently unit-testable:
CitationNormalizer — equivalent to ASySD's format_citations():
- Converts all fields to uppercase
- Removes punctuation from title, abstract, year
- Normalises DOI: strips https://doi.org/ prefix, lowercases
- Normalises page ranges: -- → -
- Treats null/empty author as "UNKNOWN"
BlockingEngine — equivalent to ASySD's compare.dedup() (4 rounds):
- Groups records by blocking key combinations (see Section 2.2)
- Produces candidate pairs (record_id pairs that share at least one blocking key)
- Deduplicates pairs across rounds
JaroWinklerScorer — equivalent to RecordLinkage::jarowinkler():
- Standard Jaro-Winkler: matching window floor(max_len / 2) - 1, Winkler prefix bonus capped at 4 characters with scaling factor 0.1
- Scores 10 fields per candidate pair: author, title, abstract, year, pages, number, volume, journal, isbn, doi
- Uses Parallel.ForEach (or PLINQ) over candidate pairs for throughput
- Missing-field handling: both-null pairs score 0 for doi/abstract/year/journal/isbn; 1 for pages/volume/number (matching ASySD's NA handling)
MatchClassifier — equivalent to ASySD's identify_true_matches():
- Applies 25 OR-combined boolean threshold rules to produce TrueMatch or ProbableDuplicate
- Applies DOI mismatch filter (low-DOI pairs with mismatching DOIs are demoted to ProbableDuplicate)
- Applies year mismatch filter (year difference > 1 demotes to ProbableDuplicate)
- See Section 2.2 for the full rule set
DuplicateGrouper — equivalent to ASySD's generate_dup_id():
- Union-Find (disjoint set) over confirmed pairs to compute transitive closure
- Selects canonical record per group using retention priority: keep_source preference > keep_label preference > lowest record index
- Returns DuplicateGroup list (see Section 5.2)
3.4 Parallelism and Scaling¶
The scoring step (Jaro-Winkler across 10 fields per candidate pair) is embarrassingly parallel — each pair is independent. PLINQ handles this within the JaroWinklerScorer with no additional infrastructure.
| Concern | Approach |
|---|---|
| Pair scoring throughput | Parallel.ForEach / PLINQ on candidate pairs |
| Large batches | 1,000-record batch limit for Stage 1; Stage 2 processes all PendingDedupCheck studies at once (see Section 3.5) |
| Concurrent batches | DedupStage1Consumer is stateless per batch; Stage 2 runs once after all Stage 1 batches complete, eliminating inter-batch race conditions |
| Memory | Blocking keys are computed in-memory; 1,000 records × 10 fields is well within heap limits |
The service interface contract (Section 5) is abstracted from the implementation. Future optimisations (e.g., pre-computed blocking indexes, approximate nearest-neighbour indexes for the fuzzy step) require only changing the implementation behind IDeduplicationService.
3.5 Large Batch Handling¶
Systematic searches can import tens of thousands of citations. The two-stage pipeline (Section 3.1) addresses the core scalability concern: Stage 1 (DOI/PMID matching) is fast and synchronous; Stage 2 (fuzzy matching) is asynchronous and decoupled from the import saga. This section specifies how large imports are handled within that architecture.
Batch size limit. Each Stage1DedupRequest SHALL cover at most 1,000 records (MAX_DEDUP_BATCH_SIZE). The import saga MUST split larger imports into multiple batches, each producing a separate message. Multiple Stage 1 batches from the same systematic search MAY run concurrently — each is stateless with respect to the others.
Staging collection. Rather than embedding records in the message, the import saga SHALL write each batch to a pmDedupBatch staging document in MongoDB before sending the Stage1DedupRequest. The message carries only a BatchId reference. The DedupStage1Consumer loads the batch, processes it, and deletes the staging document on completion. A TTL index on pmDedupBatch.expiresAt (7-day expiry) ensures orphaned documents are cleaned up if the consumer crashes.
Stage 2 triggering. After all Stage 1 batches complete for a systematic search, the import saga publishes a single Stage2DedupRequest. Stage 2 fetches all PendingDedupCheck studies from the project and runs the full fuzzy matching pipeline once over them. Because Stage 2 sees all new records together, there is no inter-batch race condition — two records that would be duplicates of each other are in the same Stage 2 input set.
Stage 2 incremental dedup. The DedupStage2Consumer includes in its input:
- All
PendingDedupCheckstudies from the new import. - Existing
Activestudies that lack apublicationId(no authoritative DOI/PMID). Studies that already have apublicationIddo not need to be included — if the new record were a duplicate of one, Stage 1 would have caught it via the shared Publication's DOI/PMID.
This means Stage 2 operates on a much smaller existing-study set than a naive "all Active studies" approach: only studies that previously had no identifier are candidates for fuzzy-match duplication with the new batch.
Stage 2 frequency. For MVP, Stage 2 is triggered immediately after all Stage 1 batches for a systematic search complete. A future optimisation MAY coalesce multiple concurrent imports into a single Stage 2 run if several systematic searches complete within a short time window.
4. Confidence Model¶
4.1 Two-Tier Classification¶
The MatchClassifier component (Section 3.3) classifies candidate pairs into two tiers:
-
AutoConfirmed: High-confidence duplicates identified by ASySD's internal heuristic filters. These are pairs where the match evidence is strong enough for automatic resolution. For studies with no review data, these are auto-merged. For studies with review data, these are flagged for admin confirmation (see Section 7, Bibliographic Consolidation Rules).
-
ProbableDuplicate: Possible matches that ASySD flags for manual review. The match evidence is suggestive but insufficient for automatic resolution. These are ALWAYS queued for admin review regardless of review data status.
4.2 Threshold Configuration¶
The MatchClassifier's 25 boolean threshold rules are NOT exposed as user-configurable parameters. The two-tier output (AutoConfirmed vs. ProbableDuplicate) provides sufficient admin oversight for MVP.
Future iteration: If project administrators request sensitivity tuning, a configurable parameter MAY be added that shifts pairs between AutoConfirmed and ProbableDuplicate tiers by adjusting the threshold rules in MatchClassifier, without changing the blocking or scoring logic.
5. Service Interface Contract¶
Phase 12 MUST implement the following types. These types define the boundary between the dedup service and the import saga / admin review queue.
5.1 Input Types¶
Stage1DedupRequest {
ProjectId: Guid
SystematicSearchId: Guid
BatchId: Guid // References a pmDedupBatch staging document
RecordCount: int // Count of records in batch (for progress tracking only)
TotalBatchCount: int // Total number of Stage 1 batches for this search
BatchIndex: int // 0-based index of this batch (saga uses to detect completion)
}
Stage2DedupRequest {
ProjectId: Guid
SystematicSearchId: Guid
IsRetroactive: bool // true for admin-triggered retroactive dedup
}
DedupBatch { // Stored in pmDedupBatch staging collection
_id: Guid // = BatchId
ProjectId: Guid
SystematicSearchId: Guid
Records: List<CitationForDedup>
CreatedAt: DateTime
ExpiresAt: DateTime // TTL index (7 days) for orphan cleanup
}
CitationForDedup {
CitationId: Guid
Title: string
Authors: string
Year: string
Journal: string
Doi: string
Pages: string
Volume: string
Issue: string
Abstract: string
Isbn: string
SourceName: string
SourceType: string
}
5.2 Output Types¶
DeduplicationResult {
AutoConfirmedGroups: List<DuplicateGroup>
ProbableGroups: List<DuplicateGroup>
UniqueRecordIds: List<Guid>
ProcessingTimeMs: long
AlgorithmVersion: string // SyRF dedup algorithm version (e.g., "1.0.0"); bumped when matching rules change
}
DuplicateGroup {
GroupId: Guid
CanonicalRecordId: Guid // Best record chosen by retention rules
MemberRecordIds: List<Guid> // All records in the group (including canonical)
Confidence: DuplicateConfidence // AutoConfirmed | ProbableDuplicate
MatchDetails: MatchSummary
}
MatchSummary {
MatchingFields: List<FieldMatch> // Which fields matched and similarity scores
BlockingRound: int // Which round formed the candidate pair (1-4)
}
FieldMatch {
FieldName: string // e.g., "title", "author", "doi"
SimilarityScore: double // 0.0-1.0 Jaro-Winkler similarity
}
enum DuplicateConfidence {
AutoConfirmed = 0,
ProbableDuplicate = 1
}
5.3 Service Abstraction¶
interface IDeduplicationService {
Task<DeduplicationResult> DeduplicateAsync(
DeduplicationRequest request,
CancellationToken cancellationToken
);
}
This interface abstracts the deduplication implementation. The C# implementation is the only supported mechanism. Future algorithmic improvements (e.g., updated blocking rules, alternative similarity metrics) require only changing the implementation behind this interface, not any consumers or saga logic.
6. Canonical Enrichment Rules¶
When multiple Citations are confirmed as duplicates and linked to the same Publication, the canonical metadata fields on the Publication SHALL be populated using best-of-breed selection. These rules align with the merge_citations = TRUE behavior documented in Hair et al. (2023).
6.1 Field-by-Field Selection Rules¶
| Field | Selection Rule | Rationale |
|---|---|---|
canonicalTitle |
Prefer longest non-empty title | Longer titles may include subtitles; more complete information |
canonicalAuthors |
Prefer longest author list (by author count) | More complete authorship record |
canonicalAbstract |
Prefer longest non-empty abstract | More complete information for screening |
canonicalYear |
Prefer explicit 4-digit year | Avoid parsed or estimated years from inconsistent sources |
canonicalJournal |
Prefer full journal name over abbreviation (longer string) | Full names are more readable and less ambiguous |
canonicalPages |
Prefer complete page range (contains "-") | Full page span is more informative than a single page number |
canonicalVolume |
Prefer non-empty value | Any volume data is better than none; no further ranking |
canonicalIssue |
Prefer non-empty value | Any issue data is better than none; no further ranking |
canonicalIsbn |
Prefer non-empty value | Any ISBN/ISSN is better than none; no further ranking |
doi |
Prefer non-empty DOI; if multiple non-empty, prefer PubMed-sourced DOI | PubMed DOIs are the most reliably formatted and verified |
6.2 Provenance Tracking¶
Each canonical field on the Publication MUST record which Citation (and therefore which search/source) provided it. This enables audit trail and cross-project attribution.
MetadataProvenance {
FieldName: string // Name of the canonical field (e.g., "canonicalTitle")
SourceCitationId: Guid // Citation that provided this field value
SourceProjectId: Guid // Project that owns the source Citation
SelectedAt: DateTime // When this field was selected/updated
}
The Publication.metadataProvenance[] array (see three-level-data-model.md Section 2) MUST contain one entry per canonical field that has been populated.
6.3 Cross-Project Enrichment¶
When a new Citation from Project B links to a Publication already known from Project A, the canonical fields on the Publication MAY be updated if Project B's Citation has better metadata per the selection rules in Section 6.1.
Rules:
1. The enrichment update is automatic (no admin approval required).
2. The MetadataProvenance entry for the updated field MUST be updated to reflect the new source.
3. The previous provenance is overwritten (not preserved as history). The audit log in the DedupAuditLog captures the change event.
4. Project A's view of the Publication is immediately enriched by Project B's data (and vice versa). This is an intentional design choice: Publications accumulate the best metadata from all projects.
7. Bibliographic Consolidation Rules for Already-Reviewed Duplicates¶
These five scenarios define the exact behavioral rules for handling duplicates based on their review state. The scenarios are mutually exclusive and exhaustive for all duplicate pair cases.
7.1 Scenario Table¶
| # | Scenario | Automation Level | Action | Study.lifecycleStatus |
|---|---|---|---|---|
| 1 | High-confidence, neither study reviewed | Auto-confirm | Merge into single Study: retain all Citations on canonical Study, delete secondary Study | Canonical: Active; secondary: Duplicate |
| 2 | High-confidence, one study reviewed | Auto-confirm | Keep reviewed Study as canonical, link Citations from unreviewed Study to canonical | Canonical: unchanged; unreviewed: Duplicate |
| 3 | High-confidence, both studies reviewed (same stage) | Admin review required | If admin confirms: secondary's annotation/screening sessions become additional candidate sessions for reconciliation on canonical Study | Canonical: unchanged; secondary: Merged |
| 4 | High-confidence, both studies reviewed (different stages) | Admin review required | If admin confirms: link both Studies via shared Publication but do NOT merge review data (contextually different stage data) | Both: unchanged; linked via Publication (same publicationId) |
| 5 | Probable duplicate, any review state | Admin review always | Queue for admin decision; admin can confirm (triggers applicable rule from scenarios 1-4) or reject (mark as not duplicate) | PendingDuplicateReview until resolved |
7.2 Scenario Details¶
Scenario 1: High-confidence, neither reviewed
Both studies have no screening decisions and no annotation sessions. The dedup service auto-confirms the merge:
- Select the canonical Study using the
DuplicateGrouperretention priority (prefer abstract-bearing record; among ties, prefer more recently imported). - Move all Citations from the secondary Study to the canonical Study's
citations[]array. - Set secondary Study's
lifecycleStatus = Duplicate. - Set secondary Study's
duplicateGroupIdto the group identifier. - Update the Publication's canonical metadata using enrichment rules (Section 6.1).
- Log to DedupAuditLog with
Decision = Confirmed,DecidedBy = System.
Scenario 2: High-confidence, one reviewed
One study has review data; the other does not. The reviewed study MUST be the canonical:
- The reviewed Study becomes canonical (regardless of ASySD retention priority).
- Move all Citations from the unreviewed Study to the canonical Study's
citations[]array. - Set unreviewed Study's
lifecycleStatus = Duplicate. - Update the Publication's canonical metadata using enrichment rules (Section 6.1).
- Log to DedupAuditLog with
Decision = Confirmed,DecidedBy = System.
Scenario 3: High-confidence, both reviewed (same stage)
Both studies have review data in the same stage. Admin MUST decide:
- Queue for admin review (see Section 8, Admin Review Queue).
- If admin confirms: the canonical Study is the one with more review data (or admin choice).
- The secondary Study's annotation sessions and screening decisions become additional candidate sessions for reconciliation on the canonical Study.
- Set secondary Study's
lifecycleStatus = Merged. - The secondary Study retains all its data but is excluded from stage pools.
- Log to DedupAuditLog with
Decision = Confirmed,DecidedBy = AdminId.
Key detail for Scenario 3: "Additional candidate sessions for reconciliation" means the secondary study's annotation sessions are treated as if additional annotators had annotated the canonical study. This increases the number of candidate sessions available for reconciliation, potentially improving agreement assessment.
Scenario 4: High-confidence, both reviewed (different stages)
Both studies have review data but in different stages. The review data is contextually different and MUST NOT be merged:
- Queue for admin review (see Section 8, Admin Review Queue).
- If admin confirms: link both Studies via the same Publication (
publicationIdpoints to the same Publication). - Do NOT merge review data, move annotation sessions, or change review state.
- Both Studies retain their current
lifecycleStatus(no change). - The duplicate is resolved at the bibliographic level (shared Publication) but not at the review level.
- Log to DedupAuditLog with
Decision = Confirmed,DecidedBy = AdminId.
Scenario 5: Probable duplicate, any state
ASySD classified the pair as probable (not high-confidence). Regardless of review state:
- Set both Studies'
lifecycleStatus = PendingDuplicateReview. - Queue for admin review (see Section 8, Admin Review Queue).
- Admin decides:
- Confirm Duplicate: Apply the appropriate rule from scenarios 1-4 based on review state.
- Not Duplicate: Dismiss the pair; restore both Studies'
lifecycleStatusto their previous values; log rejection. - Defer: Keep in queue for later review.
- Log to DedupAuditLog with the admin's decision.
7.3 Key Invariant¶
NEVER auto-merge studies that have review data. Admin ALWAYS decides for reviewed studies. This is a system invariant that MUST be enforced at the service level, not just at the UI level.
8. Admin Review Queue¶
8.1 Queue Population¶
The admin review queue is populated from two sources:
- Probable duplicates (Scenario 5): All probable duplicate pairs from ASySD output.
- High-confidence duplicates with review data (Scenarios 3, 4): Auto-confirmed duplicates where both studies have review data.
8.2 Queue Item Display¶
Each queue item MUST show:
| Element | Description |
|---|---|
| Candidate pair | Both studies with their bibliographic metadata side-by-side |
| Match details | Fields that matched, Jaro-Winkler similarity scores per field, blocking round |
| Review data summary | Which stages each study has been reviewed in, number of screening decisions, number of annotation sessions |
| Confidence tier | AutoConfirmed or ProbableDuplicate |
| Import source | Which systematic search imported each study |
8.3 Admin Actions¶
| Action | Effect | Applicable To |
|---|---|---|
| Confirm Duplicate | Triggers merge per applicable scenario (1-4); removes pair from queue | All queue items |
| Not Duplicate | Dismisses the pair; restores lifecycleStatus to previous value; pair not re-queued |
All queue items |
| Defer | Keeps pair in queue for later review; no status change | All queue items |
All admin actions MUST be logged in the DedupAuditLog with the admin's investigator ID and timestamp.
9. Merge Wizard for Reviewed Duplicates¶
When an admin confirms a duplicate pair where BOTH studies have review data (Scenarios 3 and 4), a merge wizard SHALL guide the resolution.
9.1 Same-Stage Merge (Scenario 3)¶
The merge wizard presents:
- Both studies' metadata side-by-side (title, authors, year, journal).
- Both studies' annotation sessions for the shared stage -- showing session count, annotator names (anonymized if blinding is active), and completion status.
- Merge preview: Shows the canonical study after merge, with the secondary study's sessions listed as "additional candidate sessions for reconciliation."
- Confirmation button: Admin confirms; secondary study's sessions become additional candidates; secondary study set to
Merged.
9.2 Cross-Stage Link (Scenario 4)¶
The merge wizard presents:
- Both studies' metadata side-by-side.
- Stage breakdown: Shows which stages each study has review data in, emphasizing that the stages are different.
- Link preview: Shows that both studies will share the same Publication but retain separate review data.
- Confirmation button: Admin confirms; studies linked via Publication; no review data moved.
9.3 Post-Merge State¶
In both cases after admin confirmation:
- The secondary study's
lifecycleStatusis set toMerged(Scenario 3) or remains unchanged (Scenario 4). - The secondary study retains ALL its data (Citations, annotation sessions, screening decisions) but is excluded from stage study pools.
- The canonical study gains the secondary study's Citations (Scenario 3) or is linked via Publication (Scenario 4).
10. Dedup Audit Log¶
Every deduplication decision (automatic or admin-reviewed) SHALL create an audit record. The audit log provides traceability and enables reversal of incorrect dedup decisions.
10.1 Audit Entry Structure¶
DedupAuditEntry {
EntryId: Guid
ProjectId: Guid
RecordIdA: Guid // First study/Citation in the pair
RecordIdB: Guid // Second study/Citation in the pair
Confidence: DuplicateConfidence // AutoConfirmed | ProbableDuplicate
Decision: DedupDecision // Confirmed | Rejected | Pending
DecidedBy: DedupDecider // System (for auto-confirmed) | AdminId (Guid)
DecidedAt: DateTime // When the decision was made
MatchDetails: MatchSummary // Fields matched, similarity scores, blocking round
AlgorithmVersion: string // SyRF dedup algorithm version (e.g., "1.0.0"); bumped when matching rules change
Notes: string? // Optional admin notes (e.g., reason for rejection)
}
enum DedupDecision {
Confirmed = 0, // Duplicate confirmed (auto or admin)
Rejected = 1, // Not a duplicate (admin decision)
Pending = 2 // Awaiting admin review
}
// DedupDecider is a discriminated union:
// - System: automatic decision by the dedup service
// - Admin(InvestigatorId: Guid): manual decision by a project admin
10.2 Audit Log Storage¶
The DedupAuditLog SHOULD be stored as an array on the project document (embedded) or in a separate audit collection, depending on expected volume. For MVP, embedding on the project document is recommended if the expected number of dedup decisions per project is < 10,000. For larger projects, a separate pmDedupAuditLog collection MAY be created.
Decision deferred to: Phase 12 implementation (storage location).
10.3 Retention¶
Audit entries SHALL NEVER be deleted. They provide the audit trail required for reproducibility of systematic review methodology.
11. PRISMA Box 3 Integration¶
The dedup service produces the data required for PRISMA box 3 ("Records removed before screening"). See prisma-flow-diagram-mapping.md Section 3.1, Box 3.
11.1 Derivation Rules¶
| PRISMA Field | Derivation | Data Source |
|---|---|---|
duplicates |
COUNT(citations across all studies in project) - COUNT(distinct active Studies after dedup) |
Study.citations[] count vs. Study count WHERE lifecycleStatus NOT IN (Duplicate, Merged) |
excluded_automatic |
COUNT(studies WHERE lifecycleStatus = RemovedByAutomation) |
Study.lifecycleStatus |
excluded_other |
COUNT(studies WHERE lifecycleStatus = RemovedOther) |
Study.lifecycleStatus |
11.2 Count Consistency¶
The following equation MUST hold for any project at any point in time:
total_import_records = unique_active_studies + duplicates_removed + excluded_automatic + excluded_other + pending_review + pending_dedup
Where:
- total_import_records = SUM(COUNT(study.citations[])) for all studies in project
- unique_active_studies = COUNT(studies WHERE lifecycleStatus NOT IN (Duplicate, Merged, RemovedByAutomation, RemovedOther, PendingDuplicateReview, PendingDedupCheck))
- duplicates_removed = COUNT(studies WHERE lifecycleStatus IN (Duplicate, Merged))
- excluded_automatic = COUNT(studies WHERE lifecycleStatus = RemovedByAutomation)
- excluded_other = COUNT(studies WHERE lifecycleStatus = RemovedOther)
- pending_review = COUNT(studies WHERE lifecycleStatus = PendingDuplicateReview)
- pending_dedup = COUNT(studies WHERE lifecycleStatus = PendingDedupCheck)
12. Duplicate Study Exclusion from Stage Pools¶
System Invariant¶
Studies with lifecycleStatus IN (Duplicate, Merged) MUST NOT appear in any stage study pool. This invariant ensures that:
- Screeners never screen a study that has already been identified as a duplicate.
- Annotators never annotate a study that has been merged into another.
- PRISMA counts reflect only unique, active studies in the screening pipeline.
Enforcement¶
This invariant SHALL be enforced at the query/filter level, not by deletion:
- All stage study pool queries MUST include a filter:
lifecycleStatus NOT IN (Duplicate, Merged, PendingDuplicateReview, PendingDedupCheck, RemovedByAutomation, RemovedOther) PendingDedupCheckexcludes studies from pools until Stage 2 fuzzy matching resolves their duplicate status.PendingDuplicateReviewexcludes studies from pools until the admin resolves the duplicate decision.- Studies with these statuses remain in the database with all their data intact, enabling audit, reversal, and PRISMA counting.
Reversal¶
If an admin later rejects a duplicate decision (via the admin review queue):
1. The study's lifecycleStatus SHALL be restored to its previous value (typically Active).
2. The study SHALL re-enter applicable stage study pools.
3. The DedupAuditLog records the rejection.
13. Cross-References¶
- Three-Level Data Model: three-level-data-model.md -- Publication, Citation, and Study entity specifications that the dedup service operates on.
- PRISMA Flow Diagram Mapping: prisma-flow-diagram-mapping.md -- Box 3 derivation rules that depend on dedup output.
- ASySD Paper: Hair et al. (2023), BMC Biology 21, 189. PMC10483700
- ASySD R Package: github.com/camaradesuk/ASySD
Requirement Coverage¶
| Requirement ID | Coverage in This Document |
|---|---|
| DEDUP-01 | Complete: ASySD algorithm specified with blocking rounds, field mappings, C# implementation components, two-stage pipeline architecture (Stage 1 synchronous DOI/PMID, Stage 2 async fuzzy matching via MassTransit) |
| DEDUP-02 | Complete: Two-tier confidence model (AutoConfirmed / ProbableDuplicate) with import-time triggering |
| DEDUP-03 | Complete: Import pipeline creates Citation per citation, links to Publication; field mapping defined |
| DEDUP-04 | Complete: Study lifecycle status model covers Duplicate, PendingDuplicateReview, Merged statuses |
| DEDUP-05 | Complete: Canonical enrichment rules with field-by-field priority table and provenance tracking |
| DEDUP-06 | Complete: Admin review queue with display requirements, admin actions (Confirm, Reject, Defer) |
| DEDUP-07 | Complete: Merge wizard for same-stage (sessions become reconciliation candidates) and cross-stage (link via Publication) scenarios |
| DEDUP-08 | Complete: Duplicate study exclusion from stage pools as system invariant, enforced at query level |
| PRISMA-05 | Complete: PRISMA box 3 derivation rules for duplicates, excluded_automatic, and excluded_other |