Study¶

Purpose¶

This document specifies the data model that separates bibliographic identity from import evidence from review data, enabling PRISMA-compliant counting and cross-project deduplication. It is a binding constraint on all data model decisions in Phases 3-16.

The three-level model directly addresses the PRISMA 2020 distinction between records, reports, and studies (see prisma-flow-diagram-mapping.md Section 2).

Normative language: "MUST" indicates an absolute requirement. "SHALL" indicates mandatory behavior. "SHOULD" indicates a strong recommendation. "MAY" indicates optional behavior.

1. Model Overview¶

System Scope                          Project Scope
+------------------+                  +-------------------+
|   pmPublication    |                  |     pmStudy       |
|                  |                  |                   |
|  _id             |                  |  _id              |
|  doi (unique*)   |   1        many  |  projectId        |
|  pmid (unique*)  |<---------+------+  publicationId     |
|  canonicalTitle  |          |       |  lifecycleStatus   |
|  canonicalAuthors|          |       |  fullTextStatus    |
|  ...             |          |       |  duplicateGroupId  |
|  metadataProvenance[]|      |       |  screeningOutcomes[]|
|  linkedProjectIds[]|        |       |                   |
+------------------+          |       |  citations[]: |
                              |       |  +---------------+|
                              |       |  | Citation  ||
                              +-------+--+ publicationId   ||
                                      |  | systematicSearchId|
                                      |  | sourceType    ||
                                      |  | sourceName    ||
                                      |  | rawTitle      ||
                                      |  | rawAuthors    ||
                                      |  | rawAbstract   ||
                                      |  | rawDoi        ||
                                      |  | ...           ||
                                      |  | importedAt    ||
                                      |  +---------------+|
                                      +-------------------+

Relationship Summary¶

Relationship	Cardinality	Scope Crossing	Description
Publication ← Citation	1 : many	System ← Project	Many Citations (across many projects) link to one Publication
Study → Publication	many : 1	Project → System	Each Study has one canonical Publication
Study --contains→ Citation[]	1 : many	Project-internal	A Study embeds all its Citations
Publication → Project[]	1 : many	System → Project	A Publication tracks which projects reference it

Key Invariants¶

A Publication is system-scoped -- it exists independently of any project and accumulates metadata from all linked Citations across all projects.
A Citation is project-scoped and immutable -- it preserves the exact citation as imported from a specific source, enabling per-source PRISMA counting.
A Study is project-scoped and mutable -- it is the reviewable entity that annotators, screeners, and reconcilers interact with.
Deleting a Study does NOT delete the Publication (system-scoped entities are never deleted by project operations).
Multiple Citations may link to the same Study after deduplication confirms they represent the same research.

2. Publication Entity Specification¶

Collection¶

Name: pmPublication (new collection, system-scoped)

Prefix rationale: The pm prefix follows the existing Project Management bounded context convention (see MongoContext.GetCollection() at MongoContext.cs:158-167). Although Publications are system-scoped, they are managed within the Project Management domain.

Fields¶

Field	Type	Nullable	Description
`_id`	Guid (CSUUID)	No	Primary identifier
`doi`	string	Yes	Digital Object Identifier (normalized to lowercase, without "https://doi.org/" prefix)
`pmid`	string	Yes	PubMed identifier
`canonicalTitle`	string	Yes	Best-of-breed title from all linked Citations
`canonicalAuthors`	Author[]	Yes	Best-of-breed author list (prefer most complete list)
`canonicalAbstract`	string	Yes	Best-of-breed abstract (prefer non-empty)
`canonicalYear`	int?	Yes	Best-of-breed publication year
`canonicalJournal`	string	Yes	Best-of-breed journal name (prefer full name over abbreviation)
`canonicalVolume`	string	Yes	Best-of-breed volume
`canonicalPages`	string	Yes	Best-of-breed page range
`canonicalIssue`	string	Yes	Best-of-breed issue number
`canonicalIsbn`	string	Yes	Best-of-breed ISBN/ISSN
`metadataProvenance`	MetadataProvenance[]	Yes	Field-level tracking of which Citation provided each canonical value
`linkedProjectIds`	Guid[]	No	Projects that have Citations referencing this Publication (denormalized for query efficiency)
`createdAt`	DateTime	No	Timestamp of Publication creation
`updatedAt`	DateTime	No	Timestamp of last metadata update

MetadataProvenance Embedded Document¶

Field	Type	Description
`fieldName`	string	Name of the canonical field (e.g., "canonicalTitle", "canonicalAbstract")
`sourceCitationId`	Guid	Citation that provided this field value
`sourceProjectId`	Guid	Project that owns the source Citation
`updatedAt`	DateTime	When this field was last updated from this source

Indexes¶

Index	Fields	Type	Purpose
`ix_doi`	`doi`	Unique, sparse	Fast lookup by DOI; enforces DOI uniqueness across system
`ix_pmid`	`pmid`	Unique, sparse	Fast lookup by PMID; enforces PMID uniqueness across system
`ix_linkedProjectIds`	`linkedProjectIds`	Regular	Find all Publications used by a project

Sparse index note: The unique, sparse index type means the uniqueness constraint only applies to documents where the field is present (non-null). This is critical because many citations lack DOIs or PMIDs.

Behavioral Rules¶

Creation: A Publication SHALL be created when a Citation is imported and no existing Publication matches by DOI, PMID, or ASySD dedup algorithm.
Update: A Publication SHALL be updated when a new Citation (from any project) provides better metadata for any canonical field. "Better" is defined by the canonical enrichment rules (see Section 6).
Deletion: A Publication SHALL NEVER be deleted, even if all Citations referencing it are removed from all projects.
Cross-project: Multiple projects MAY reference the same Publication. The linkedProjectIds array MUST be updated when a new project creates a Citation linking to this Publication.
GUID representation: All Guid fields MUST use CSUUID (C# Legacy GUID format, BinData subtype 3) as configured in MongoUtils.cs.

3. Citation Value Object Specification¶

Storage¶

Embedded on: Study document as citations[] array.

Design decision: Citations are embedded on the Study document rather than stored in a separate collection. This is recommended for MVP because:

A Study's Citations are always accessed together with the Study.
The expected cardinality is low (typically 1-5 Citations per Study after dedup).
Embedding avoids cross-collection joins for the common read path.
If Study documents grow too large (>16MB MongoDB limit), Citations can be extracted to a separate pmCitation collection in a future phase. The publicationId and studyId fields provide the join keys needed for this migration.

Fields¶

Field	Type	Nullable	Description
`_id`	Guid (CSUUID)	No	Unique identifier for this citation
`publicationId`	Guid (CSUUID)	No	Links to `pmPublication._id` -- the system-scoped bibliographic identity
`projectId`	Guid (CSUUID)	No	Project that owns this citation
`systematicSearchId`	Guid (CSUUID)	No	Links to the `SystematicSearch` that imported this record
`sourceType`	SearchSourceType	No	Classification of the import source (Database, Register, Website, Organisation, CitationSearching, Other)
`sourceName`	string	No	Human-readable source name (e.g., "PubMed", "Embase", "ClinicalTrials.gov")
`rawTitle`	string	Yes	Title as imported (never modified)
`rawAuthors`	string	Yes	Authors as imported (never modified)
`rawAbstract`	string	Yes	Abstract as imported (never modified)
`rawYear`	string	Yes	Publication year as imported (never modified; string to preserve original format)
`rawDoi`	string	Yes	DOI as imported (never modified)
`rawJournal`	string	Yes	Journal name as imported (never modified)
`rawVolume`	string	Yes	Volume as imported (never modified)
`rawPages`	string	Yes	Pages as imported (never modified)
`rawIssue`	string	Yes	Issue number as imported (never modified)
`rawIsbn`	string	Yes	ISBN/ISSN as imported (never modified)
`importedAt`	DateTime	No	Timestamp of import operation
`referenceFileId`	Guid (CSUUID)	Yes	Links to specific file within the systematic search

Behavioral Rules¶

Immutability: A Citation SHALL NEVER be modified after creation. This is a PRISMA requirement: the original import data MUST be preserved to enable per-source counting (derivation rules for PRISMA boxes 2, 11 depend on this).
Creation: One Citation SHALL be created per citation per import operation. If the same citation appears in multiple files within a search, each instance creates a separate Citation.
Dedup linking: Multiple Citations MAY link to the same Study after deduplication confirms they represent the same research investigation.
Source type: The sourceType field SHALL be populated from the SystematicSearch.sourceType that triggered the import. This enables PRISMA Column 1 vs. Column 2 assignment.
Raw field prefix: All raw bibliographic fields are prefixed with raw to distinguish them from the canonical (best-of-breed) fields on the Publication entity. Raw fields SHALL NEVER be modified.

4. Study Entity Modifications¶

Collection¶

Name: pmStudy (existing collection, project-scoped)

New Fields¶

The following fields SHALL be added to the existing Study document. All new fields MUST be nullable to ensure backward compatibility with existing documents.

Field	Type	Nullable	Default	Description
`lifecycleStatus`	StudyLifecycleStatus	Yes	`null` (treated as Active)	Track study position in the review pipeline
`duplicateGroupId`	Guid? (CSUUID)	Yes	`null`	Links duplicate studies to a group for dedup tracking
`publicationId`	Guid? (CSUUID)	Yes	`null`	Links to the `pmPublication` this study represents
`citations`	Citation[]	Yes	`null` (treated as empty)	Embedded array of all citations linked to this study
`fullTextStatus`	FullTextStatus	Yes	`null` (treated as Pending)	Track full-text retrieval progress
`screeningOutcomes`	ScreeningOutcome[]	Yes	`null` (treated as empty)	Per-profile screening results (specified in Phase 15, placeholder here)
`metaAnalysisIncluded`	bool?	Yes	`null`	Whether the study is included in quantitative synthesis (meta-analysis)

StudyLifecycleStatus Enum¶

StudyLifecycleStatus:
  Active = 0                  // Default: available for screening/review
  Duplicate = 1               // Confirmed duplicate (auto or admin-confirmed)
  PendingDuplicateReview = 2  // Probable duplicate awaiting admin review
  FullTextSought = 3          // Full text retrieval attempted
  FullTextNotRetrieved = 4    // Full text could not be obtained
  Included = 5                // Final: included in review
  Merged = 6                  // Merged into another study (duplicate resolution)
  RemovedByAutomation = 7     // Removed by automation tool (PRISMA box 3)
  RemovedOther = 8            // Removed for other pre-screen reasons (PRISMA box 3)
  PendingDedupCheck = 9       // Stage 1 exact-match found no match; awaiting Stage 2 fuzzy matching

Critical design decision: Screening exclusion is NOT a lifecycle status. Screening outcomes are per-profile on the Study (in screeningOutcomes[]). The lifecycle status tracks the study's position in the overall review pipeline, while screening outcomes track per-criteria decisions. This separation is essential because a study can be excluded under one screening profile and included under another in a multi-stage pipeline. See prisma-flow-diagram-mapping.md Section 3 for how lifecycle status maps to PRISMA boxes.

FullTextStatus Enum¶

FullTextStatus:
  Pending = 0         // Not yet sought (default for studies that pass title/abstract screening)
  Sought = 1          // Full text retrieval has been attempted
  Retrieved = 2       // Full text has been obtained
  NotRetrieved = 3    // Full text could not be obtained (PRISMA boxes 7, 13)

ScreeningOutcome Embedded Document (Placeholder)¶

This structure is fully specified in Phase 15. The placeholder here establishes the shape for forward compatibility:

Field	Type	Description
`profileId`	Guid	Screening profile that produced this outcome
`stageId`	Guid	Stage where screening occurred
`finalOutcome`	FinalScreeningOutcomeValue	Included / Excluded / Conflict / Pending
`primaryExclusionReason`	string?	Structured reason (for PRISMA box 9/15 reporting)
`decidedAt`	DateTime	When the outcome was determined
`source`	ScreeningAuthoritySource	Reconciled / CandidateAgreement / Admin

Note: The exact field names and types for ScreeningOutcome may be refined in Phase 15. This placeholder establishes that screening outcomes are per-profile, stored as an array on the Study document, and include structured exclusion reasons.

5. Relationship Rules¶

Publication ← Citation (System ← Project)¶

One Publication can be linked from Citations in MANY projects.
The Publication.linkedProjectIds[] array MUST be updated whenever a Citation in a new project links to this Publication.
When a project is deleted, its Citations are removed but the Publication persists. The project's ID SHALL be removed from linkedProjectIds[].

Study → Publication (Project → System)¶

One Study has ONE canonical Publication (via publicationId).
Multiple Studies (in different projects) MAY reference the same Publication.
The canonical Publication provides the best-of-breed metadata for display.
When a Study is created from a Citation, publicationId SHALL be set to the Citation's publicationId.

Study --contains→ Citation[] (Project-internal)¶

One Study has MANY Citations (from the same project, possibly from different systematic searches).
After dedup, multiple Citations that represent the same research are linked to the same Study.
The Citation count per Study enables the "records identified vs. studies included" distinction in PRISMA.
Citations are immutable; Study metadata is mutable (enriched from best-of-breed Publication data).

Cascading Rules¶

Operation	Effect on Publication	Effect on Citations	Effect on Study
Delete Study	Remove projectId from `linkedProjectIds` (if no other Studies in project link to it)	Removed with Study (embedded)	Deleted
Delete Project	Remove projectId from all affected Publications' `linkedProjectIds`	Removed with Studies (embedded)	All project Studies deleted
Merge Studies (dedup)	No change (both already link to same Publication)	Citations from secondary Study moved to primary Study	Secondary Study set to `Merged` status
Import new citation	Create or link to existing Publication	Create new Citation on Study	Create new Study (or add Citation to existing Study if dedup match)

6. Canonical Enrichment Rules¶

When multiple Citations link to the same Publication (across projects), the canonical metadata fields on the Publication SHALL be populated using best-of-breed selection:

Field	Selection Rule	Rationale
`canonicalTitle`	Prefer longest non-empty title	Longer titles are typically more complete
`canonicalAuthors`	Prefer longest/most complete author list	More authors = more complete record
`canonicalAbstract`	Prefer non-empty abstract; among non-empty, prefer longest	Abstract presence is critical for screening
`canonicalYear`	Prefer explicit numeric year	Explicit year is more reliable than parsed
`canonicalJournal`	Prefer full journal name over abbreviation	Full name is more informative
`canonicalVolume`	Prefer non-empty	Any volume data is better than none
`canonicalPages`	Prefer complete page range (containing "-")	Complete range is more informative
`canonicalIssue`	Prefer non-empty	Any issue data is better than none
`canonicalIsbn`	Prefer non-empty	Any ISBN/ISSN is better than none

Provenance tracking: When a canonical field is updated, a MetadataProvenance entry MUST be added or updated to record which Citation provided the value.

ASySD alignment: These rules align with the merge_citations = TRUE behavior in the ASySD R package (Hair et al., 2023).

7. Migration Strategy¶

The three-level model is introduced incrementally across three releases to minimize migration risk.

Phase 7 (Release 1): Forward-Compatible Schema¶

Change	Entity	Migration Type	Rollback
Add nullable `sourceType` field	`SystematicSearch`	Additive	`$unset: { sourceType: "" }`
Add nullable `sourceName` field	`SystematicSearch`	Additive	`$unset: { sourceName: "" }`
Verify Study schema supports additive fields	`Study`	Validation only	N/A

Validation: The Study document schema MUST accept the future fields (lifecycleStatus, citations, publicationId, duplicateGroupId, fullTextStatus, screeningOutcomes, metaAnalysisIncluded) without breaking existing code. Since MongoDB is schemaless, this means verifying that no code path assumes these fields do NOT exist.

Phase 12 (Release 3): Core Three-Level Model¶

Change	Entity	Migration Type	Rollback
Create `pmPublication` collection with indexes	New collection	Additive	Drop collection
Add `citations[]` to Study	`Study`	Additive (nullable)	`$unset: { citations: "" }`
Add `publicationId` to Study	`Study`	Additive (nullable)	`$unset: { publicationId: "" }`
Add `lifecycleStatus` to Study	`Study`	Additive (nullable)	`$unset: { lifecycleStatus: "" }`
Add `duplicateGroupId` to Study	`Study`	Additive (nullable)	`$unset: { duplicateGroupId: "" }`
Add `fullTextStatus` to Study	`Study`	Additive (nullable)	`$unset: { fullTextStatus: "" }`
Backfill: Create Publication per unique DOI/PMID from existing Studies	`pmPublication`	Backfill	Delete created Publications
Backfill: Create Citation per existing Study from `SystematicSearch` data	`Study`	Backfill	`$unset: { citations: "" }`
Backfill: Set `publicationId` where Publication was created	`Study`	Backfill	`$unset: { publicationId: "" }`

Phase 16 (Release 3): Full PRISMA Support¶

Change	Entity	Migration Type	Rollback
Backfill `lifecycleStatus = Active` on all existing studies	`Study`	Backfill	`$unset: { lifecycleStatus: "" }`
Populate `sourceType` on SystematicSearches where determinable from `LibraryFileType`	`SystematicSearch`	Backfill	`$unset: { sourceType: "" }`
Add `screeningOutcomes[]` to Study	`Study`	Additive (nullable)	`$unset: { screeningOutcomes: "" }`
Add `metaAnalysisIncluded` to Study	`Study`	Additive (nullable)	`$unset: { metaAnalysisIncluded: "" }`

Source type inference rules for backfill:

`LibraryFileType`	Inferred `sourceType`	Confidence
`PubmedXml`	`Database`	High
`EndnoteXml`	Unknown (Endnote exports from any source)	Cannot infer
`TsvLibrary`	Unknown	Cannot infer
`CsvLibrary`	Unknown	Cannot infer
`LivingSearchJson`	Unknown (depends on configured source)	Cannot infer

For ambiguous cases, sourceType SHALL remain null and an admin interface SHALL allow manual classification.

8. PRISMA Counting Implications¶

Each level of the model enables specific PRISMA counts. This section maps the entity structure to PRISMA reporting capabilities.

Citation Level -- "Records Identified"¶

Citations enable per-source-type counting because each preserves its sourceType and sourceName immutably:

PRISMA Box	Count	Data Path
Box 2: Records from databases	`COUNT(citations) WHERE sourceType = Database`	`Study.citations[].sourceType`
Box 2: Records from registers	`COUNT(citations) WHERE sourceType = Register`	`Study.citations[].sourceType`
Box 11: Records from websites	`COUNT(citations) WHERE sourceType = Website`	`Study.citations[].sourceType`
Box 11: Records from organisations	`COUNT(citations) WHERE sourceType = Organisation`	`Study.citations[].sourceType`
Box 11: Records from citation searching	`COUNT(citations) WHERE sourceType = CitationSearching`	`Study.citations[].sourceType`

Citation vs. Study Count -- "Duplicates Removed"¶

The difference between Citation count and unique Study count after dedup gives the duplicate removal count:

PRISMA Box	Count	Derivation
Box 3: Duplicates removed	`SUM(importRecord count across project) - COUNT(unique Studies WHERE lifecycleStatus NOT IN (Duplicate, Merged))`	Citation preservation + lifecycle status

Study.lifecycleStatus -- Terminal State Boxes¶

The lifecycle status directly populates PRISMA terminal state boxes:

PRISMA Box	lifecycleStatus	Count
Box 3: Excluded by automation	`RemovedByAutomation`	`COUNT(studies WHERE lifecycleStatus = RemovedByAutomation)`
Box 3: Excluded other	`RemovedOther`	`COUNT(studies WHERE lifecycleStatus = RemovedOther)`
Box 7/13: Reports not retrieved	`FullTextNotRetrieved`	`COUNT(studies WHERE lifecycleStatus = FullTextNotRetrieved)` or `fullTextStatus = NotRetrieved`
Box 10/16: Studies included	`Included`	`COUNT(studies WHERE lifecycleStatus = Included)`

Study.screeningOutcomes[] -- Screening Boxes¶

Per-profile screening outcomes enable the screening-related PRISMA boxes:

PRISMA Box	Data Path	Count
Box 4: Records screened	Stage pool membership	Studies that entered T/A screening pool
Box 5: Records excluded (T/A)	`screeningOutcomes[profileId=TA].finalOutcome = Excluded`	Per-profile exclusion count
Box 8: Reports assessed	Stage pool membership	Studies that entered FT screening pool
Box 9/15: Reports excluded with reasons	`screeningOutcomes[profileId=FT].primaryExclusionReason`	Grouped by reason

9. Open Questions¶

Q1: Citation Storage Model (Embedded vs. Separate Collection)¶

Current recommendation: Embedded on Study (citations[] array).

Rationale: Low cardinality (1-5 per Study typically), always accessed with Study, avoids cross-collection joins.

Risk: If a Study has many Citations (e.g., a highly-cited paper imported from 50+ searches), the embedded array could contribute to document size approaching the 16MB MongoDB limit. However, this is an extreme edge case.

Migration path: If embedded storage proves insufficient: 1. Create pmCitation collection with studyId and publicationId indexes. 2. Move Citations from Study.citations[] to the new collection. 3. Update all queries to use the collection instead of the embedded array. 4. This is a non-breaking change (additive collection + removal of embedded field).

Q2: Living Search Incremental Dedup Behavior ✅ Resolved¶

Context: SyRF supports living searches that periodically add new records. When new records arrive, they MUST be deduplicated against existing project records.

Resolution: Two-stage pipeline:

Stage 1 (synchronous): DOI/PMID exact match against pmPublication resolves the majority of well-sourced imports immediately. Unmatched records are created with lifecycleStatus = PendingDedupCheck.
Stage 2 (asynchronous): The DedupStage2Consumer runs the full C# fuzzy matching pipeline (blocking, Jaro-Winkler scoring, classification, grouping) over all PendingDedupCheck studies plus existing Active studies that lack a publicationId. Only studies without an authoritative identifier need to be included — if the new record were a duplicate of an identified study, Stage 1 would have caught it. After grouping, only groups containing at least one PendingDedupCheck record are acted on.

Resolved in: deduplication-service-specification.md Section 3.5.

Q3: Previous Studies Arm (Updated Reviews)¶

Context: PRISMA 2020 box 1 supports "studies from previous version of review." SyRF does not currently support updated reviews.

Recommendation: Defer. The StudyLifecycleStatus enum MAY be extended with PreviouslyIncluded in a future phase. No architectural changes required -- it is an additive extension.

Decision deferred to: Post-Release 3.

10. Cross-References¶

PRISMA Flow Diagram Mapping: prisma-flow-diagram-mapping.md -- Box-to-field mapping that references entities defined in this document.
ASySD Paper: Hair et al. (2023), BMC Biology 21, 189 -- Deduplication algorithm and canonical enrichment rules.
MongoDB CSUUID Configuration: MongoUtils.cs:23-37 -- GUID serialization format.
MongoContext Collection Naming: MongoContext.cs:158-167 -- pm prefix convention.

Requirement Coverage¶

Requirement ID	Coverage in This Document
PRISMA-03	Complete: Three-level model fully specified with entity definitions, field listings, relationships, behavioral rules
ARCH-07	Complete: `pmPublication` collection specified with DOI/PMID indexes
DEDUP-01	Referenced: Dedup service creates Citations and links to Publications
DEDUP-03	Referenced: Import pipeline creates Citation per citation
DEDUP-04	Referenced: StudyLifecycleStatus tracks duplicate status
DEDUP-05	Referenced: Canonical enrichment rules for best-of-breed metadata