Deduplication¶
When you import citations from multiple bibliographic databases, the same paper often appears more than once. SyRF automatically detects and manages these duplicates, saving your team from tedious manual deduplication and ensuring accurate counts for PRISMA reporting.
How Deduplication Works¶
Deduplication runs automatically whenever new citations are imported into your project. The system uses a two-step process:
Step 1: Exact Matching¶
First, SyRF checks whether the new citation's DOI or PubMed ID already exists in the system. If an exact match is found, the citation is linked to the existing record immediately. This is fast, reliable, and handles the majority of duplicates from well-structured databases.
Step 2: Fuzzy Matching (ASySD Algorithm)¶
For citations without exact identifiers -- or where identifiers differ between databases -- SyRF uses the ASySD algorithm (Automated Systematic Search Deduplication). ASySD was developed by CAMARADES and is published in BMC Biology. It compares bibliographic fields including title, authors, year, journal, abstract, pages, volume, and issue across four rounds of progressively broader comparisons.
The algorithm classifies each potential duplicate pair by confidence:
| Confidence Level | What Happens |
|---|---|
| High confidence | The duplicate is confirmed automatically. The duplicate study is removed from screening pools and its import data is preserved. |
| Uncertain | The pair is flagged for your review in the Duplicate Review queue. |
Reviewing Flagged Duplicates¶
Navigate to Project > Duplicate Review to see pairs the system is uncertain about. Each entry shows two studies the system thinks might be the same paper.
What You See¶
For each flagged pair, you see:
- Title of both studies
- Authors of both studies
- Year, journal, volume, issue, and pages
- DOI and PubMed ID (if available)
- Confidence score -- how certain the system is that these are duplicates
- Source -- which database each citation came from (e.g., "PubMed" vs. "Embase")
Making a Decision¶
For each pair, choose one of two actions:
- Confirm Duplicate: The studies are the same paper. SyRF merges them into a single canonical study.
- Not a Duplicate: The studies are different papers. SyRF keeps both and removes them from the review queue.
Your decision is recorded permanently. You can review past decisions in the audit trail.
Tips for Reviewing¶
- Focus on titles and authors first. If both match closely, the studies are almost certainly duplicates.
- Check DOIs when available. If both studies have DOIs and they differ, they are probably not duplicates (unless one DOI is incorrect).
- Be cautious with conference abstracts. The same research may appear as both a conference abstract and a full paper. These are typically different records even though they describe the same study.
What Happens When Studies Are Merged¶
When two studies are confirmed as duplicates -- whether automatically or through your review -- the system creates a single canonical study with the best metadata from both records.
Metadata Enrichment¶
The canonical study receives the best metadata available from all import sources:
| Field | Rule |
|---|---|
| Title | The longest version is preferred (may include subtitle) |
| Authors | The most complete author list is preferred |
| Abstract | A non-empty abstract is preferred over an empty one |
| DOI | A PubMed-sourced DOI is preferred (most reliably formatted) |
The system tracks which import source provided each canonical value, maintaining full auditability.
Preserving Your Work¶
- Import records are never deleted. The raw data from each import source is preserved exactly as it was uploaded.
- Annotations are preserved. If either study had annotations, those annotations are kept and become candidates for reconciliation on the canonical study.
- Screening history is preserved. If either study was screened, the screening decisions are maintained on the canonical study.
- Merged duplicates are excluded from pools. The duplicate study remains in the database for audit purposes but never appears in screening or annotation queues.
The Safety Guarantee¶
SyRF never automatically merges studies that have already been reviewed (annotated or screened). When duplicate citations have been independently reviewed, an administrator must explicitly confirm the merge. This protects your team's existing work.
Understanding the Three-Level Model¶
Deduplication introduces a fundamental change to how SyRF manages bibliographic data. To understand what happens with your imports, it helps to understand the three levels:
Level 1: Your Import (Citation)¶
When you upload a search file, each citation creates an Citation. This is the exact data from your file -- title, authors, year, journal, and all other fields, preserved exactly as imported. Import records are immutable: they are never modified after creation. This ensures you always have a faithful record of what came from each source.
Level 2: The Study¶
The Study is the working copy that your team annotates and screens. It starts as a copy of the import data but can be enriched with better metadata from other sources (when duplicates are detected). Each study belongs to your project and carries all your annotations, screening decisions, and reconciliation records.
Level 3: The Publication¶
The Publication is a global bibliographic record shared across the entire SyRF system. If two different projects import the same paper, they link to the same Publication. This enables cross-project deduplication and ensures bibliographic data is consistent across the platform.
Why This Matters¶
- Better metadata: When the same paper is imported from multiple databases, the canonical study receives the best fields from each source.
- Accurate PRISMA counts: The three-level model makes it straightforward to count total records imported, duplicates removed, and unique studies screened -- the exact numbers the PRISMA flow diagram requires.
- Cross-project awareness: If a paper has already been reviewed in another project, the Publication record carries that information (though each project's review data remains independent).
flowchart TD
subgraph "Your Imports"
IR1["PubMed citation"]
IR2["Embase citation"]
IR3["Scopus citation"]
end
subgraph "Your Project"
S["Canonical Study<br/>(best metadata)"]
end
subgraph "Global"
R["Publication<br/>(shared identity)"]
end
IR1 --> S
IR2 --> S
IR3 --> S
S --> R
style S fill:#e8f5e9
style R fill:#e1f5fe
Deduplication and PRISMA¶
The PRISMA 2020 flow diagram requires accurate counts of:
- Records identified from databases/registers: The total number of Citations, grouped by source type
- Duplicates removed: The number of studies with Duplicate or Merged status
- Records screened: The number of unique, active studies after deduplication
These counts are computed automatically from the three-level model. You do not need to calculate them manually. See Data Export and PRISMA for details on generating the PRISMA flow diagram.
Related¶
- Data Export and PRISMA -- how deduplication data feeds the PRISMA diagram
- Screening Profiles -- deduplication runs before screening
- Feature Brief
- Deduplication Specification
- Platform Architecture