Skip to content

Three-Level Data Model Specification: Publication / Citation / Study

Purpose

This document specifies the data model that separates bibliographic identity from import evidence from review data, enabling PRISMA-compliant counting and cross-project deduplication. It is a binding constraint on all data model decisions in Phases 3-16.

The three-level model directly addresses the PRISMA 2020 distinction between records, reports, and studies (see prisma-flow-diagram-mapping.md Section 2).

Normative language: "MUST" indicates an absolute requirement. "SHALL" indicates mandatory behavior. "SHOULD" indicates a strong recommendation. "MAY" indicates optional behavior.

1. Model Overview

System Scope                          Project Scope
+------------------+                  +-------------------+
|   pmPublication    |                  |     pmStudy       |
|                  |                  |                   |
|  _id             |                  |  _id              |
|  doi (unique*)   |   1        many  |  projectId        |
|  pmid (unique*)  |<---------+------+  publicationId     |
|  canonicalTitle  |          |       |  lifecycleStatus   |
|  canonicalAuthors|          |       |  fullTextStatus    |
|  ...             |          |       |  duplicateGroupId  |
|  metadataProvenance[]|      |       |  screeningOutcomes[]|
|  linkedProjectIds[]|        |       |                   |
+------------------+          |       |  citations[]: |
                              |       |  +---------------+|
                              |       |  | Citation  ||
                              +-------+--+ publicationId   ||
                                      |  | systematicSearchId|
                                      |  | sourceType    ||
                                      |  | sourceName    ||
                                      |  | rawTitle      ||
                                      |  | rawAuthors    ||
                                      |  | rawAbstract   ||
                                      |  | rawDoi        ||
                                      |  | ...           ||
                                      |  | importedAt    ||
                                      |  +---------------+|
                                      +-------------------+

Relationship Summary

Relationship Cardinality Scope Crossing Description
Publication ← Citation 1 : many System ← Project Many Citations (across many projects) link to one Publication
Study → Publication many : 1 Project → System Each Study has one canonical Publication
Study --contains→ Citation[] 1 : many Project-internal A Study embeds all its Citations
Publication → Project[] 1 : many System → Project A Publication tracks which projects reference it

Key Invariants

  1. A Publication is system-scoped -- it exists independently of any project and accumulates metadata from all linked Citations across all projects.
  2. A Citation is project-scoped and immutable -- it preserves the exact citation as imported from a specific source, enabling per-source PRISMA counting.
  3. A Study is project-scoped and mutable -- it is the reviewable entity that annotators, screeners, and reconcilers interact with.
  4. Deleting a Study does NOT delete the Publication (system-scoped entities are never deleted by project operations).
  5. Multiple Citations may link to the same Study after deduplication confirms they represent the same research.

2. Publication Entity Specification

Collection

Name: pmPublication (new collection, system-scoped)

Prefix rationale: The pm prefix follows the existing Project Management bounded context convention (see MongoContext.GetCollection() at MongoContext.cs:158-167). Although Publications are system-scoped, they are managed within the Project Management domain.

Fields

Field Type Nullable Description
_id Guid (CSUUID) No Primary identifier
doi string Yes Digital Object Identifier (normalized to lowercase, without "https://doi.org/" prefix)
pmid string Yes PubMed identifier
canonicalTitle string Yes Best-of-breed title from all linked Citations
canonicalAuthors Author[] Yes Best-of-breed author list (prefer most complete list)
canonicalAbstract string Yes Best-of-breed abstract (prefer non-empty)
canonicalYear int? Yes Best-of-breed publication year
canonicalJournal string Yes Best-of-breed journal name (prefer full name over abbreviation)
canonicalVolume string Yes Best-of-breed volume
canonicalPages string Yes Best-of-breed page range
canonicalIssue string Yes Best-of-breed issue number
canonicalIsbn string Yes Best-of-breed ISBN/ISSN
metadataProvenance MetadataProvenance[] Yes Field-level tracking of which Citation provided each canonical value
linkedProjectIds Guid[] No Projects that have Citations referencing this Publication (denormalized for query efficiency)
createdAt DateTime No Timestamp of Publication creation
updatedAt DateTime No Timestamp of last metadata update

MetadataProvenance Embedded Document

Field Type Description
fieldName string Name of the canonical field (e.g., "canonicalTitle", "canonicalAbstract")
sourceCitationId Guid Citation that provided this field value
sourceProjectId Guid Project that owns the source Citation
updatedAt DateTime When this field was last updated from this source

Indexes

Index Fields Type Purpose
ix_doi doi Unique, sparse Fast lookup by DOI; enforces DOI uniqueness across system
ix_pmid pmid Unique, sparse Fast lookup by PMID; enforces PMID uniqueness across system
ix_linkedProjectIds linkedProjectIds Regular Find all Publications used by a project

Sparse index note: The unique, sparse index type means the uniqueness constraint only applies to documents where the field is present (non-null). This is critical because many citations lack DOIs or PMIDs.

Behavioral Rules

  1. Creation: A Publication SHALL be created when a Citation is imported and no existing Publication matches by DOI, PMID, or ASySD dedup algorithm.
  2. Update: A Publication SHALL be updated when a new Citation (from any project) provides better metadata for any canonical field. "Better" is defined by the canonical enrichment rules (see Section 6).
  3. Deletion: A Publication SHALL NEVER be deleted, even if all Citations referencing it are removed from all projects.
  4. Cross-project: Multiple projects MAY reference the same Publication. The linkedProjectIds array MUST be updated when a new project creates a Citation linking to this Publication.
  5. GUID representation: All Guid fields MUST use CSUUID (C# Legacy GUID format, BinData subtype 3) as configured in MongoUtils.cs.

3. Citation Value Object Specification

Storage

Embedded on: Study document as citations[] array.

Design decision: Citations are embedded on the Study document rather than stored in a separate collection. This is recommended for MVP because:

  • A Study's Citations are always accessed together with the Study.
  • The expected cardinality is low (typically 1-5 Citations per Study after dedup).
  • Embedding avoids cross-collection joins for the common read path.
  • If Study documents grow too large (>16MB MongoDB limit), Citations can be extracted to a separate pmCitation collection in a future phase. The publicationId and studyId fields provide the join keys needed for this migration.

Fields

Field Type Nullable Description
_id Guid (CSUUID) No Unique identifier for this citation
publicationId Guid (CSUUID) No Links to pmPublication._id -- the system-scoped bibliographic identity
projectId Guid (CSUUID) No Project that owns this citation
systematicSearchId Guid (CSUUID) No Links to the SystematicSearch that imported this record
sourceType SearchSourceType No Classification of the import source (Database, Register, Website, Organisation, CitationSearching, Other)
sourceName string No Human-readable source name (e.g., "PubMed", "Embase", "ClinicalTrials.gov")
rawTitle string Yes Title as imported (never modified)
rawAuthors string Yes Authors as imported (never modified)
rawAbstract string Yes Abstract as imported (never modified)
rawYear string Yes Publication year as imported (never modified; string to preserve original format)
rawDoi string Yes DOI as imported (never modified)
rawJournal string Yes Journal name as imported (never modified)
rawVolume string Yes Volume as imported (never modified)
rawPages string Yes Pages as imported (never modified)
rawIssue string Yes Issue number as imported (never modified)
rawIsbn string Yes ISBN/ISSN as imported (never modified)
importedAt DateTime No Timestamp of import operation
referenceFileId Guid (CSUUID) Yes Links to specific file within the systematic search

Behavioral Rules

  1. Immutability: A Citation SHALL NEVER be modified after creation. This is a PRISMA requirement: the original import data MUST be preserved to enable per-source counting (derivation rules for PRISMA boxes 2, 11 depend on this).
  2. Creation: One Citation SHALL be created per citation per import operation. If the same citation appears in multiple files within a search, each instance creates a separate Citation.
  3. Dedup linking: Multiple Citations MAY link to the same Study after deduplication confirms they represent the same research investigation.
  4. Source type: The sourceType field SHALL be populated from the SystematicSearch.sourceType that triggered the import. This enables PRISMA Column 1 vs. Column 2 assignment.
  5. Raw field prefix: All raw bibliographic fields are prefixed with raw to distinguish them from the canonical (best-of-breed) fields on the Publication entity. Raw fields SHALL NEVER be modified.

4. Study Entity Modifications

Collection

Name: pmStudy (existing collection, project-scoped)

New Fields

The following fields SHALL be added to the existing Study document. All new fields MUST be nullable to ensure backward compatibility with existing documents.

Field Type Nullable Default Description
lifecycleStatus StudyLifecycleStatus Yes null (treated as Active) Track study position in the review pipeline
duplicateGroupId Guid? (CSUUID) Yes null Links duplicate studies to a group for dedup tracking
publicationId Guid? (CSUUID) Yes null Links to the pmPublication this study represents
citations Citation[] Yes null (treated as empty) Embedded array of all citations linked to this study
fullTextStatus FullTextStatus Yes null (treated as Pending) Track full-text retrieval progress
screeningOutcomes ScreeningOutcome[] Yes null (treated as empty) Per-profile screening results (specified in Phase 15, placeholder here)
metaAnalysisIncluded bool? Yes null Whether the study is included in quantitative synthesis (meta-analysis)

StudyLifecycleStatus Enum

StudyLifecycleStatus:
  Active = 0                  // Default: available for screening/review
  Duplicate = 1               // Confirmed duplicate (auto or admin-confirmed)
  PendingDuplicateReview = 2  // Probable duplicate awaiting admin review
  FullTextSought = 3          // Full text retrieval attempted
  FullTextNotRetrieved = 4    // Full text could not be obtained
  Included = 5                // Final: included in review
  Merged = 6                  // Merged into another study (duplicate resolution)
  RemovedByAutomation = 7     // Removed by automation tool (PRISMA box 3)
  RemovedOther = 8            // Removed for other pre-screen reasons (PRISMA box 3)
  PendingDedupCheck = 9       // Stage 1 exact-match found no match; awaiting Stage 2 fuzzy matching

Critical design decision: Screening exclusion is NOT a lifecycle status. Screening outcomes are per-profile on the Study (in screeningOutcomes[]). The lifecycle status tracks the study's position in the overall review pipeline, while screening outcomes track per-criteria decisions. This separation is essential because a study can be excluded under one screening profile and included under another in a multi-stage pipeline. See prisma-flow-diagram-mapping.md Section 3 for how lifecycle status maps to PRISMA boxes.

FullTextStatus Enum

FullTextStatus:
  Pending = 0         // Not yet sought (default for studies that pass title/abstract screening)
  Sought = 1          // Full text retrieval has been attempted
  Retrieved = 2       // Full text has been obtained
  NotRetrieved = 3    // Full text could not be obtained (PRISMA boxes 7, 13)

ScreeningOutcome Embedded Document (Placeholder)

This structure is fully specified in Phase 15. The placeholder here establishes the shape for forward compatibility:

Field Type Description
profileId Guid Screening profile that produced this outcome
stageId Guid Stage where screening occurred
finalOutcome FinalScreeningOutcomeValue Included / Excluded / Conflict / Pending
primaryExclusionReason string? Structured reason (for PRISMA box 9/15 reporting)
decidedAt DateTime When the outcome was determined
source ScreeningAuthoritySource Reconciled / CandidateAgreement / Admin

Note: The exact field names and types for ScreeningOutcome may be refined in Phase 15. This placeholder establishes that screening outcomes are per-profile, stored as an array on the Study document, and include structured exclusion reasons.

5. Relationship Rules

Publication ← Citation (System ← Project)

  • One Publication can be linked from Citations in MANY projects.
  • The Publication.linkedProjectIds[] array MUST be updated whenever a Citation in a new project links to this Publication.
  • When a project is deleted, its Citations are removed but the Publication persists. The project's ID SHALL be removed from linkedProjectIds[].

Study → Publication (Project → System)

  • One Study has ONE canonical Publication (via publicationId).
  • Multiple Studies (in different projects) MAY reference the same Publication.
  • The canonical Publication provides the best-of-breed metadata for display.
  • When a Study is created from a Citation, publicationId SHALL be set to the Citation's publicationId.

Study --contains→ Citation[] (Project-internal)

  • One Study has MANY Citations (from the same project, possibly from different systematic searches).
  • After dedup, multiple Citations that represent the same research are linked to the same Study.
  • The Citation count per Study enables the "records identified vs. studies included" distinction in PRISMA.
  • Citations are immutable; Study metadata is mutable (enriched from best-of-breed Publication data).

Cascading Rules

Operation Effect on Publication Effect on Citations Effect on Study
Delete Study Remove projectId from linkedProjectIds (if no other Studies in project link to it) Removed with Study (embedded) Deleted
Delete Project Remove projectId from all affected Publications' linkedProjectIds Removed with Studies (embedded) All project Studies deleted
Merge Studies (dedup) No change (both already link to same Publication) Citations from secondary Study moved to primary Study Secondary Study set to Merged status
Import new citation Create or link to existing Publication Create new Citation on Study Create new Study (or add Citation to existing Study if dedup match)

6. Canonical Enrichment Rules

When multiple Citations link to the same Publication (across projects), the canonical metadata fields on the Publication SHALL be populated using best-of-breed selection:

Field Selection Rule Rationale
canonicalTitle Prefer longest non-empty title Longer titles are typically more complete
canonicalAuthors Prefer longest/most complete author list More authors = more complete record
canonicalAbstract Prefer non-empty abstract; among non-empty, prefer longest Abstract presence is critical for screening
canonicalYear Prefer explicit numeric year Explicit year is more reliable than parsed
canonicalJournal Prefer full journal name over abbreviation Full name is more informative
canonicalVolume Prefer non-empty Any volume data is better than none
canonicalPages Prefer complete page range (containing "-") Complete range is more informative
canonicalIssue Prefer non-empty Any issue data is better than none
canonicalIsbn Prefer non-empty Any ISBN/ISSN is better than none

Provenance tracking: When a canonical field is updated, a MetadataProvenance entry MUST be added or updated to record which Citation provided the value.

ASySD alignment: These rules align with the merge_citations = TRUE behavior in the ASySD R package (Hair et al., 2023).

7. Migration Strategy

The three-level model is introduced incrementally across three releases to minimize migration risk.

Phase 7 (Release 1): Forward-Compatible Schema

Change Entity Migration Type Rollback
Add nullable sourceType field SystematicSearch Additive $unset: { sourceType: "" }
Add nullable sourceName field SystematicSearch Additive $unset: { sourceName: "" }
Verify Study schema supports additive fields Study Validation only N/A

Validation: The Study document schema MUST accept the future fields (lifecycleStatus, citations, publicationId, duplicateGroupId, fullTextStatus, screeningOutcomes, metaAnalysisIncluded) without breaking existing code. Since MongoDB is schemaless, this means verifying that no code path assumes these fields do NOT exist.

Phase 12 (Release 3): Core Three-Level Model

Change Entity Migration Type Rollback
Create pmPublication collection with indexes New collection Additive Drop collection
Add citations[] to Study Study Additive (nullable) $unset: { citations: "" }
Add publicationId to Study Study Additive (nullable) $unset: { publicationId: "" }
Add lifecycleStatus to Study Study Additive (nullable) $unset: { lifecycleStatus: "" }
Add duplicateGroupId to Study Study Additive (nullable) $unset: { duplicateGroupId: "" }
Add fullTextStatus to Study Study Additive (nullable) $unset: { fullTextStatus: "" }
Backfill: Create Publication per unique DOI/PMID from existing Studies pmPublication Backfill Delete created Publications
Backfill: Create Citation per existing Study from SystematicSearch data Study Backfill $unset: { citations: "" }
Backfill: Set publicationId where Publication was created Study Backfill $unset: { publicationId: "" }

Phase 16 (Release 3): Full PRISMA Support

Change Entity Migration Type Rollback
Backfill lifecycleStatus = Active on all existing studies Study Backfill $unset: { lifecycleStatus: "" }
Populate sourceType on SystematicSearches where determinable from LibraryFileType SystematicSearch Backfill $unset: { sourceType: "" }
Add screeningOutcomes[] to Study Study Additive (nullable) $unset: { screeningOutcomes: "" }
Add metaAnalysisIncluded to Study Study Additive (nullable) $unset: { metaAnalysisIncluded: "" }

Source type inference rules for backfill:

LibraryFileType Inferred sourceType Confidence
PubmedXml Database High
EndnoteXml Unknown (Endnote exports from any source) Cannot infer
TsvLibrary Unknown Cannot infer
CsvLibrary Unknown Cannot infer
LivingSearchJson Unknown (depends on configured source) Cannot infer

For ambiguous cases, sourceType SHALL remain null and an admin interface SHALL allow manual classification.

8. PRISMA Counting Implications

Each level of the model enables specific PRISMA counts. This section maps the entity structure to PRISMA reporting capabilities.

Citation Level -- "Records Identified"

Citations enable per-source-type counting because each preserves its sourceType and sourceName immutably:

PRISMA Box Count Data Path
Box 2: Records from databases COUNT(citations) WHERE sourceType = Database Study.citations[].sourceType
Box 2: Records from registers COUNT(citations) WHERE sourceType = Register Study.citations[].sourceType
Box 11: Records from websites COUNT(citations) WHERE sourceType = Website Study.citations[].sourceType
Box 11: Records from organisations COUNT(citations) WHERE sourceType = Organisation Study.citations[].sourceType
Box 11: Records from citation searching COUNT(citations) WHERE sourceType = CitationSearching Study.citations[].sourceType

Citation vs. Study Count -- "Duplicates Removed"

The difference between Citation count and unique Study count after dedup gives the duplicate removal count:

PRISMA Box Count Derivation
Box 3: Duplicates removed SUM(importRecord count across project) - COUNT(unique Studies WHERE lifecycleStatus NOT IN (Duplicate, Merged)) Citation preservation + lifecycle status

Study.lifecycleStatus -- Terminal State Boxes

The lifecycle status directly populates PRISMA terminal state boxes:

PRISMA Box lifecycleStatus Count
Box 3: Excluded by automation RemovedByAutomation COUNT(studies WHERE lifecycleStatus = RemovedByAutomation)
Box 3: Excluded other RemovedOther COUNT(studies WHERE lifecycleStatus = RemovedOther)
Box 7/13: Reports not retrieved FullTextNotRetrieved COUNT(studies WHERE lifecycleStatus = FullTextNotRetrieved) or fullTextStatus = NotRetrieved
Box 10/16: Studies included Included COUNT(studies WHERE lifecycleStatus = Included)

Study.screeningOutcomes[] -- Screening Boxes

Per-profile screening outcomes enable the screening-related PRISMA boxes:

PRISMA Box Data Path Count
Box 4: Records screened Stage pool membership Studies that entered T/A screening pool
Box 5: Records excluded (T/A) screeningOutcomes[profileId=TA].finalOutcome = Excluded Per-profile exclusion count
Box 8: Reports assessed Stage pool membership Studies that entered FT screening pool
Box 9/15: Reports excluded with reasons screeningOutcomes[profileId=FT].primaryExclusionReason Grouped by reason

9. Open Questions

Q1: Citation Storage Model (Embedded vs. Separate Collection)

Current recommendation: Embedded on Study (citations[] array).

Rationale: Low cardinality (1-5 per Study typically), always accessed with Study, avoids cross-collection joins.

Risk: If a Study has many Citations (e.g., a highly-cited paper imported from 50+ searches), the embedded array could contribute to document size approaching the 16MB MongoDB limit. However, this is an extreme edge case.

Migration path: If embedded storage proves insufficient: 1. Create pmCitation collection with studyId and publicationId indexes. 2. Move Citations from Study.citations[] to the new collection. 3. Update all queries to use the collection instead of the embedded array. 4. This is a non-breaking change (additive collection + removal of embedded field).

Q2: Living Search Incremental Dedup Behavior ✅ Resolved

Context: SyRF supports living searches that periodically add new records. When new records arrive, they MUST be deduplicated against existing project records.

Resolution: Two-stage pipeline:

  1. Stage 1 (synchronous): DOI/PMID exact match against pmPublication resolves the majority of well-sourced imports immediately. Unmatched records are created with lifecycleStatus = PendingDedupCheck.
  2. Stage 2 (asynchronous): The DedupStage2Consumer runs the full C# fuzzy matching pipeline (blocking, Jaro-Winkler scoring, classification, grouping) over all PendingDedupCheck studies plus existing Active studies that lack a publicationId. Only studies without an authoritative identifier need to be included — if the new record were a duplicate of an identified study, Stage 1 would have caught it. After grouping, only groups containing at least one PendingDedupCheck record are acted on.

Resolved in: deduplication-service-specification.md Section 3.5.

Q3: Previous Studies Arm (Updated Reviews)

Context: PRISMA 2020 box 1 supports "studies from previous version of review." SyRF does not currently support updated reviews.

Recommendation: Defer. The StudyLifecycleStatus enum MAY be extended with PreviouslyIncluded in a future phase. No architectural changes required -- it is an additive extension.

Decision deferred to: Post-Release 3.

10. Cross-References

  • PRISMA Flow Diagram Mapping: prisma-flow-diagram-mapping.md -- Box-to-field mapping that references entities defined in this document.
  • ASySD Paper: Hair et al. (2023), BMC Biology 21, 189 -- Deduplication algorithm and canonical enrichment rules.
  • MongoDB CSUUID Configuration: MongoUtils.cs:23-37 -- GUID serialization format.
  • MongoContext Collection Naming: MongoContext.cs:158-167 -- pm prefix convention.

Requirement Coverage

Requirement ID Coverage in This Document
PRISMA-03 Complete: Three-level model fully specified with entity definitions, field listings, relationships, behavioral rules
ARCH-07 Complete: pmPublication collection specified with DOI/PMID indexes
DEDUP-01 Referenced: Dedup service creates Citations and links to Publications
DEDUP-03 Referenced: Import pipeline creates Citation per citation
DEDUP-04 Referenced: StudyLifecycleStatus tracks duplicate status
DEDUP-05 Referenced: Canonical enrichment rules for best-of-breed metadata