Three-Level Data Model Specification: Publication / Citation / Study¶
Purpose¶
This document specifies the data model that separates bibliographic identity from import evidence from review data, enabling PRISMA-compliant counting and cross-project deduplication. It is a binding constraint on all data model decisions in Phases 3-16.
The three-level model directly addresses the PRISMA 2020 distinction between records, reports, and studies (see prisma-flow-diagram-mapping.md Section 2).
Normative language: "MUST" indicates an absolute requirement. "SHALL" indicates mandatory behavior. "SHOULD" indicates a strong recommendation. "MAY" indicates optional behavior.
1. Model Overview¶
System Scope Project Scope
+------------------+ +-------------------+
| pmPublication | | pmStudy |
| | | |
| _id | | _id |
| doi (unique*) | 1 many | projectId |
| pmid (unique*) |<---------+------+ publicationId |
| canonicalTitle | | | lifecycleStatus |
| canonicalAuthors| | | fullTextStatus |
| ... | | | duplicateGroupId |
| metadataProvenance[]| | | screeningOutcomes[]|
| linkedProjectIds[]| | | |
+------------------+ | | citations[]: |
| | +---------------+|
| | | Citation ||
+-------+--+ publicationId ||
| | systematicSearchId|
| | sourceType ||
| | sourceName ||
| | rawTitle ||
| | rawAuthors ||
| | rawAbstract ||
| | rawDoi ||
| | ... ||
| | importedAt ||
| +---------------+|
+-------------------+
Relationship Summary¶
| Relationship | Cardinality | Scope Crossing | Description |
|---|---|---|---|
| Publication ← Citation | 1 : many | System ← Project | Many Citations (across many projects) link to one Publication |
| Study → Publication | many : 1 | Project → System | Each Study has one canonical Publication |
| Study --contains→ Citation[] | 1 : many | Project-internal | A Study embeds all its Citations |
| Publication → Project[] | 1 : many | System → Project | A Publication tracks which projects reference it |
Key Invariants¶
- A
Publicationis system-scoped -- it exists independently of any project and accumulates metadata from all linked Citations across all projects. - A
Citationis project-scoped and immutable -- it preserves the exact citation as imported from a specific source, enabling per-source PRISMA counting. - A
Studyis project-scoped and mutable -- it is the reviewable entity that annotators, screeners, and reconcilers interact with. - Deleting a Study does NOT delete the Publication (system-scoped entities are never deleted by project operations).
- Multiple Citations may link to the same Study after deduplication confirms they represent the same research.
2. Publication Entity Specification¶
Collection¶
Name: pmPublication (new collection, system-scoped)
Prefix rationale: The pm prefix follows the existing Project Management bounded context convention (see MongoContext.GetCollection() at MongoContext.cs:158-167). Although Publications are system-scoped, they are managed within the Project Management domain.
Fields¶
| Field | Type | Nullable | Description |
|---|---|---|---|
_id |
Guid (CSUUID) | No | Primary identifier |
doi |
string | Yes | Digital Object Identifier (normalized to lowercase, without "https://doi.org/" prefix) |
pmid |
string | Yes | PubMed identifier |
canonicalTitle |
string | Yes | Best-of-breed title from all linked Citations |
canonicalAuthors |
Author[] | Yes | Best-of-breed author list (prefer most complete list) |
canonicalAbstract |
string | Yes | Best-of-breed abstract (prefer non-empty) |
canonicalYear |
int? | Yes | Best-of-breed publication year |
canonicalJournal |
string | Yes | Best-of-breed journal name (prefer full name over abbreviation) |
canonicalVolume |
string | Yes | Best-of-breed volume |
canonicalPages |
string | Yes | Best-of-breed page range |
canonicalIssue |
string | Yes | Best-of-breed issue number |
canonicalIsbn |
string | Yes | Best-of-breed ISBN/ISSN |
metadataProvenance |
MetadataProvenance[] | Yes | Field-level tracking of which Citation provided each canonical value |
linkedProjectIds |
Guid[] | No | Projects that have Citations referencing this Publication (denormalized for query efficiency) |
createdAt |
DateTime | No | Timestamp of Publication creation |
updatedAt |
DateTime | No | Timestamp of last metadata update |
MetadataProvenance Embedded Document¶
| Field | Type | Description |
|---|---|---|
fieldName |
string | Name of the canonical field (e.g., "canonicalTitle", "canonicalAbstract") |
sourceCitationId |
Guid | Citation that provided this field value |
sourceProjectId |
Guid | Project that owns the source Citation |
updatedAt |
DateTime | When this field was last updated from this source |
Indexes¶
| Index | Fields | Type | Purpose |
|---|---|---|---|
ix_doi |
doi |
Unique, sparse | Fast lookup by DOI; enforces DOI uniqueness across system |
ix_pmid |
pmid |
Unique, sparse | Fast lookup by PMID; enforces PMID uniqueness across system |
ix_linkedProjectIds |
linkedProjectIds |
Regular | Find all Publications used by a project |
Sparse index note: The unique, sparse index type means the uniqueness constraint only applies to documents where the field is present (non-null). This is critical because many citations lack DOIs or PMIDs.
Behavioral Rules¶
- Creation: A Publication SHALL be created when a Citation is imported and no existing Publication matches by DOI, PMID, or ASySD dedup algorithm.
- Update: A Publication SHALL be updated when a new Citation (from any project) provides better metadata for any canonical field. "Better" is defined by the canonical enrichment rules (see Section 6).
- Deletion: A Publication SHALL NEVER be deleted, even if all Citations referencing it are removed from all projects.
- Cross-project: Multiple projects MAY reference the same Publication. The
linkedProjectIdsarray MUST be updated when a new project creates a Citation linking to this Publication. - GUID representation: All Guid fields MUST use CSUUID (C# Legacy GUID format, BinData subtype 3) as configured in
MongoUtils.cs.
3. Citation Value Object Specification¶
Storage¶
Embedded on: Study document as citations[] array.
Design decision: Citations are embedded on the Study document rather than stored in a separate collection. This is recommended for MVP because:
- A Study's Citations are always accessed together with the Study.
- The expected cardinality is low (typically 1-5 Citations per Study after dedup).
- Embedding avoids cross-collection joins for the common read path.
- If Study documents grow too large (>16MB MongoDB limit), Citations can be extracted to a separate
pmCitationcollection in a future phase. ThepublicationIdandstudyIdfields provide the join keys needed for this migration.
Fields¶
| Field | Type | Nullable | Description |
|---|---|---|---|
_id |
Guid (CSUUID) | No | Unique identifier for this citation |
publicationId |
Guid (CSUUID) | No | Links to pmPublication._id -- the system-scoped bibliographic identity |
projectId |
Guid (CSUUID) | No | Project that owns this citation |
systematicSearchId |
Guid (CSUUID) | No | Links to the SystematicSearch that imported this record |
sourceType |
SearchSourceType | No | Classification of the import source (Database, Register, Website, Organisation, CitationSearching, Other) |
sourceName |
string | No | Human-readable source name (e.g., "PubMed", "Embase", "ClinicalTrials.gov") |
rawTitle |
string | Yes | Title as imported (never modified) |
rawAuthors |
string | Yes | Authors as imported (never modified) |
rawAbstract |
string | Yes | Abstract as imported (never modified) |
rawYear |
string | Yes | Publication year as imported (never modified; string to preserve original format) |
rawDoi |
string | Yes | DOI as imported (never modified) |
rawJournal |
string | Yes | Journal name as imported (never modified) |
rawVolume |
string | Yes | Volume as imported (never modified) |
rawPages |
string | Yes | Pages as imported (never modified) |
rawIssue |
string | Yes | Issue number as imported (never modified) |
rawIsbn |
string | Yes | ISBN/ISSN as imported (never modified) |
importedAt |
DateTime | No | Timestamp of import operation |
referenceFileId |
Guid (CSUUID) | Yes | Links to specific file within the systematic search |
Behavioral Rules¶
- Immutability: A Citation SHALL NEVER be modified after creation. This is a PRISMA requirement: the original import data MUST be preserved to enable per-source counting (derivation rules for PRISMA boxes 2, 11 depend on this).
- Creation: One Citation SHALL be created per citation per import operation. If the same citation appears in multiple files within a search, each instance creates a separate Citation.
- Dedup linking: Multiple Citations MAY link to the same Study after deduplication confirms they represent the same research investigation.
- Source type: The
sourceTypefield SHALL be populated from theSystematicSearch.sourceTypethat triggered the import. This enables PRISMA Column 1 vs. Column 2 assignment. - Raw field prefix: All raw bibliographic fields are prefixed with
rawto distinguish them from the canonical (best-of-breed) fields on the Publication entity. Raw fields SHALL NEVER be modified.
4. Study Entity Modifications¶
Collection¶
Name: pmStudy (existing collection, project-scoped)
New Fields¶
The following fields SHALL be added to the existing Study document. All new fields MUST be nullable to ensure backward compatibility with existing documents.
| Field | Type | Nullable | Default | Description |
|---|---|---|---|---|
lifecycleStatus |
StudyLifecycleStatus | Yes | null (treated as Active) |
Track study position in the review pipeline |
duplicateGroupId |
Guid? (CSUUID) | Yes | null |
Links duplicate studies to a group for dedup tracking |
publicationId |
Guid? (CSUUID) | Yes | null |
Links to the pmPublication this study represents |
citations |
Citation[] | Yes | null (treated as empty) |
Embedded array of all citations linked to this study |
fullTextStatus |
FullTextStatus | Yes | null (treated as Pending) |
Track full-text retrieval progress |
screeningOutcomes |
ScreeningOutcome[] | Yes | null (treated as empty) |
Per-profile screening results (specified in Phase 15, placeholder here) |
metaAnalysisIncluded |
bool? | Yes | null |
Whether the study is included in quantitative synthesis (meta-analysis) |
StudyLifecycleStatus Enum¶
StudyLifecycleStatus:
Active = 0 // Default: available for screening/review
Duplicate = 1 // Confirmed duplicate (auto or admin-confirmed)
PendingDuplicateReview = 2 // Probable duplicate awaiting admin review
FullTextSought = 3 // Full text retrieval attempted
FullTextNotRetrieved = 4 // Full text could not be obtained
Included = 5 // Final: included in review
Merged = 6 // Merged into another study (duplicate resolution)
RemovedByAutomation = 7 // Removed by automation tool (PRISMA box 3)
RemovedOther = 8 // Removed for other pre-screen reasons (PRISMA box 3)
PendingDedupCheck = 9 // Stage 1 exact-match found no match; awaiting Stage 2 fuzzy matching
Critical design decision: Screening exclusion is NOT a lifecycle status. Screening outcomes are per-profile on the Study (in screeningOutcomes[]). The lifecycle status tracks the study's position in the overall review pipeline, while screening outcomes track per-criteria decisions. This separation is essential because a study can be excluded under one screening profile and included under another in a multi-stage pipeline. See prisma-flow-diagram-mapping.md Section 3 for how lifecycle status maps to PRISMA boxes.
FullTextStatus Enum¶
FullTextStatus:
Pending = 0 // Not yet sought (default for studies that pass title/abstract screening)
Sought = 1 // Full text retrieval has been attempted
Retrieved = 2 // Full text has been obtained
NotRetrieved = 3 // Full text could not be obtained (PRISMA boxes 7, 13)
ScreeningOutcome Embedded Document (Placeholder)¶
This structure is fully specified in Phase 15. The placeholder here establishes the shape for forward compatibility:
| Field | Type | Description |
|---|---|---|
profileId |
Guid | Screening profile that produced this outcome |
stageId |
Guid | Stage where screening occurred |
finalOutcome |
FinalScreeningOutcomeValue | Included / Excluded / Conflict / Pending |
primaryExclusionReason |
string? | Structured reason (for PRISMA box 9/15 reporting) |
decidedAt |
DateTime | When the outcome was determined |
source |
ScreeningAuthoritySource | Reconciled / CandidateAgreement / Admin |
Note: The exact field names and types for ScreeningOutcome may be refined in Phase 15. This placeholder establishes that screening outcomes are per-profile, stored as an array on the Study document, and include structured exclusion reasons.
5. Relationship Rules¶
Publication ← Citation (System ← Project)¶
- One
Publicationcan be linked from Citations in MANY projects. - The
Publication.linkedProjectIds[]array MUST be updated whenever a Citation in a new project links to this Publication. - When a project is deleted, its Citations are removed but the Publication persists. The project's ID SHALL be removed from
linkedProjectIds[].
Study → Publication (Project → System)¶
- One Study has ONE canonical Publication (via
publicationId). - Multiple Studies (in different projects) MAY reference the same Publication.
- The canonical Publication provides the best-of-breed metadata for display.
- When a Study is created from a Citation,
publicationIdSHALL be set to the Citation'spublicationId.
Study --contains→ Citation[] (Project-internal)¶
- One Study has MANY Citations (from the same project, possibly from different systematic searches).
- After dedup, multiple Citations that represent the same research are linked to the same Study.
- The Citation count per Study enables the "records identified vs. studies included" distinction in PRISMA.
- Citations are immutable; Study metadata is mutable (enriched from best-of-breed Publication data).
Cascading Rules¶
| Operation | Effect on Publication | Effect on Citations | Effect on Study |
|---|---|---|---|
| Delete Study | Remove projectId from linkedProjectIds (if no other Studies in project link to it) |
Removed with Study (embedded) | Deleted |
| Delete Project | Remove projectId from all affected Publications' linkedProjectIds |
Removed with Studies (embedded) | All project Studies deleted |
| Merge Studies (dedup) | No change (both already link to same Publication) | Citations from secondary Study moved to primary Study | Secondary Study set to Merged status |
| Import new citation | Create or link to existing Publication | Create new Citation on Study | Create new Study (or add Citation to existing Study if dedup match) |
6. Canonical Enrichment Rules¶
When multiple Citations link to the same Publication (across projects), the canonical metadata fields on the Publication SHALL be populated using best-of-breed selection:
| Field | Selection Rule | Rationale |
|---|---|---|
canonicalTitle |
Prefer longest non-empty title | Longer titles are typically more complete |
canonicalAuthors |
Prefer longest/most complete author list | More authors = more complete record |
canonicalAbstract |
Prefer non-empty abstract; among non-empty, prefer longest | Abstract presence is critical for screening |
canonicalYear |
Prefer explicit numeric year | Explicit year is more reliable than parsed |
canonicalJournal |
Prefer full journal name over abbreviation | Full name is more informative |
canonicalVolume |
Prefer non-empty | Any volume data is better than none |
canonicalPages |
Prefer complete page range (containing "-") | Complete range is more informative |
canonicalIssue |
Prefer non-empty | Any issue data is better than none |
canonicalIsbn |
Prefer non-empty | Any ISBN/ISSN is better than none |
Provenance tracking: When a canonical field is updated, a MetadataProvenance entry MUST be added or updated to record which Citation provided the value.
ASySD alignment: These rules align with the merge_citations = TRUE behavior in the ASySD R package (Hair et al., 2023).
7. Migration Strategy¶
The three-level model is introduced incrementally across three releases to minimize migration risk.
Phase 7 (Release 1): Forward-Compatible Schema¶
| Change | Entity | Migration Type | Rollback |
|---|---|---|---|
Add nullable sourceType field |
SystematicSearch |
Additive | $unset: { sourceType: "" } |
Add nullable sourceName field |
SystematicSearch |
Additive | $unset: { sourceName: "" } |
| Verify Study schema supports additive fields | Study |
Validation only | N/A |
Validation: The Study document schema MUST accept the future fields (lifecycleStatus, citations, publicationId, duplicateGroupId, fullTextStatus, screeningOutcomes, metaAnalysisIncluded) without breaking existing code. Since MongoDB is schemaless, this means verifying that no code path assumes these fields do NOT exist.
Phase 12 (Release 3): Core Three-Level Model¶
| Change | Entity | Migration Type | Rollback |
|---|---|---|---|
Create pmPublication collection with indexes |
New collection | Additive | Drop collection |
Add citations[] to Study |
Study |
Additive (nullable) | $unset: { citations: "" } |
Add publicationId to Study |
Study |
Additive (nullable) | $unset: { publicationId: "" } |
Add lifecycleStatus to Study |
Study |
Additive (nullable) | $unset: { lifecycleStatus: "" } |
Add duplicateGroupId to Study |
Study |
Additive (nullable) | $unset: { duplicateGroupId: "" } |
Add fullTextStatus to Study |
Study |
Additive (nullable) | $unset: { fullTextStatus: "" } |
| Backfill: Create Publication per unique DOI/PMID from existing Studies | pmPublication |
Backfill | Delete created Publications |
Backfill: Create Citation per existing Study from SystematicSearch data |
Study |
Backfill | $unset: { citations: "" } |
Backfill: Set publicationId where Publication was created |
Study |
Backfill | $unset: { publicationId: "" } |
Phase 16 (Release 3): Full PRISMA Support¶
| Change | Entity | Migration Type | Rollback |
|---|---|---|---|
Backfill lifecycleStatus = Active on all existing studies |
Study |
Backfill | $unset: { lifecycleStatus: "" } |
Populate sourceType on SystematicSearches where determinable from LibraryFileType |
SystematicSearch |
Backfill | $unset: { sourceType: "" } |
Add screeningOutcomes[] to Study |
Study |
Additive (nullable) | $unset: { screeningOutcomes: "" } |
Add metaAnalysisIncluded to Study |
Study |
Additive (nullable) | $unset: { metaAnalysisIncluded: "" } |
Source type inference rules for backfill:
LibraryFileType |
Inferred sourceType |
Confidence |
|---|---|---|
PubmedXml |
Database |
High |
EndnoteXml |
Unknown (Endnote exports from any source) | Cannot infer |
TsvLibrary |
Unknown | Cannot infer |
CsvLibrary |
Unknown | Cannot infer |
LivingSearchJson |
Unknown (depends on configured source) | Cannot infer |
For ambiguous cases, sourceType SHALL remain null and an admin interface SHALL allow manual classification.
8. PRISMA Counting Implications¶
Each level of the model enables specific PRISMA counts. This section maps the entity structure to PRISMA reporting capabilities.
Citation Level -- "Records Identified"¶
Citations enable per-source-type counting because each preserves its sourceType and sourceName immutably:
| PRISMA Box | Count | Data Path |
|---|---|---|
| Box 2: Records from databases | COUNT(citations) WHERE sourceType = Database |
Study.citations[].sourceType |
| Box 2: Records from registers | COUNT(citations) WHERE sourceType = Register |
Study.citations[].sourceType |
| Box 11: Records from websites | COUNT(citations) WHERE sourceType = Website |
Study.citations[].sourceType |
| Box 11: Records from organisations | COUNT(citations) WHERE sourceType = Organisation |
Study.citations[].sourceType |
| Box 11: Records from citation searching | COUNT(citations) WHERE sourceType = CitationSearching |
Study.citations[].sourceType |
Citation vs. Study Count -- "Duplicates Removed"¶
The difference between Citation count and unique Study count after dedup gives the duplicate removal count:
| PRISMA Box | Count | Derivation |
|---|---|---|
| Box 3: Duplicates removed | SUM(importRecord count across project) - COUNT(unique Studies WHERE lifecycleStatus NOT IN (Duplicate, Merged)) |
Citation preservation + lifecycle status |
Study.lifecycleStatus -- Terminal State Boxes¶
The lifecycle status directly populates PRISMA terminal state boxes:
| PRISMA Box | lifecycleStatus | Count |
|---|---|---|
| Box 3: Excluded by automation | RemovedByAutomation |
COUNT(studies WHERE lifecycleStatus = RemovedByAutomation) |
| Box 3: Excluded other | RemovedOther |
COUNT(studies WHERE lifecycleStatus = RemovedOther) |
| Box 7/13: Reports not retrieved | FullTextNotRetrieved |
COUNT(studies WHERE lifecycleStatus = FullTextNotRetrieved) or fullTextStatus = NotRetrieved |
| Box 10/16: Studies included | Included |
COUNT(studies WHERE lifecycleStatus = Included) |
Study.screeningOutcomes[] -- Screening Boxes¶
Per-profile screening outcomes enable the screening-related PRISMA boxes:
| PRISMA Box | Data Path | Count |
|---|---|---|
| Box 4: Records screened | Stage pool membership | Studies that entered T/A screening pool |
| Box 5: Records excluded (T/A) | screeningOutcomes[profileId=TA].finalOutcome = Excluded |
Per-profile exclusion count |
| Box 8: Reports assessed | Stage pool membership | Studies that entered FT screening pool |
| Box 9/15: Reports excluded with reasons | screeningOutcomes[profileId=FT].primaryExclusionReason |
Grouped by reason |
9. Open Questions¶
Q1: Citation Storage Model (Embedded vs. Separate Collection)¶
Current recommendation: Embedded on Study (citations[] array).
Rationale: Low cardinality (1-5 per Study typically), always accessed with Study, avoids cross-collection joins.
Risk: If a Study has many Citations (e.g., a highly-cited paper imported from 50+ searches), the embedded array could contribute to document size approaching the 16MB MongoDB limit. However, this is an extreme edge case.
Migration path: If embedded storage proves insufficient:
1. Create pmCitation collection with studyId and publicationId indexes.
2. Move Citations from Study.citations[] to the new collection.
3. Update all queries to use the collection instead of the embedded array.
4. This is a non-breaking change (additive collection + removal of embedded field).
Q2: Living Search Incremental Dedup Behavior ✅ Resolved¶
Context: SyRF supports living searches that periodically add new records. When new records arrive, they MUST be deduplicated against existing project records.
Resolution: Two-stage pipeline:
- Stage 1 (synchronous): DOI/PMID exact match against
pmPublicationresolves the majority of well-sourced imports immediately. Unmatched records are created withlifecycleStatus = PendingDedupCheck. - Stage 2 (asynchronous): The
DedupStage2Consumerruns the full C# fuzzy matching pipeline (blocking, Jaro-Winkler scoring, classification, grouping) over allPendingDedupCheckstudies plus existingActivestudies that lack apublicationId. Only studies without an authoritative identifier need to be included — if the new record were a duplicate of an identified study, Stage 1 would have caught it. After grouping, only groups containing at least onePendingDedupCheckrecord are acted on.
Resolved in: deduplication-service-specification.md Section 3.5.
Q3: Previous Studies Arm (Updated Reviews)¶
Context: PRISMA 2020 box 1 supports "studies from previous version of review." SyRF does not currently support updated reviews.
Recommendation: Defer. The StudyLifecycleStatus enum MAY be extended with PreviouslyIncluded in a future phase. No architectural changes required -- it is an additive extension.
Decision deferred to: Post-Release 3.
10. Cross-References¶
- PRISMA Flow Diagram Mapping: prisma-flow-diagram-mapping.md -- Box-to-field mapping that references entities defined in this document.
- ASySD Paper: Hair et al. (2023), BMC Biology 21, 189 -- Deduplication algorithm and canonical enrichment rules.
- MongoDB CSUUID Configuration:
MongoUtils.cs:23-37-- GUID serialization format. - MongoContext Collection Naming:
MongoContext.cs:158-167--pmprefix convention.
Requirement Coverage¶
| Requirement ID | Coverage in This Document |
|---|---|
| PRISMA-03 | Complete: Three-level model fully specified with entity definitions, field listings, relationships, behavioral rules |
| ARCH-07 | Complete: pmPublication collection specified with DOI/PMID indexes |
| DEDUP-01 | Referenced: Dedup service creates Citations and links to Publications |
| DEDUP-03 | Referenced: Import pipeline creates Citation per citation |
| DEDUP-04 | Referenced: StudyLifecycleStatus tracks duplicate status |
| DEDUP-05 | Referenced: Canonical enrichment rules for best-of-breed metadata |