Shared: Data Model & Migration Plan¶
Summary¶
This document describes the schema enrichments to existing MongoDB aggregates required to support versioned questions, stage-level question configuration, and reconciliation records. It also details the migration strategy for existing data.
Parent: Annotation Management & Reconciliation
⚠️ Precedence note: This document predates the Design Decisions (D1–D50) and the Annotation Versioning Design (D37–D50). Where this document contradicts those authoritative references, the design decisions take precedence. In particular:
- The Reconciliation Records schema (embedded
ReconciliationRecordinStudy.ExtractionInfo) has been superseded by the standalone Reconciliation Session model (D5). The migration approach and tooling remain valid; the target schema needs updating.- The embedded-first, extract-if-needed design principle, migration strategy (additive fields, rollback-by-unset), monitoring queries, and environment strategy remain current and applicable to Phase 7.
This document was merged from the closed PR #2325 branch to preserve implementation detail not carried forward during spec restructuring. It should be reconciled with the design decisions before Phase 7 execution.
Design Principles¶
Embedded-First, Extract-If-Needed¶
After design review, we chose to keep all data embedded in existing aggregates rather than creating new collections. The rationale:
| Principle | Embedded | Separate Collections |
|---|---|---|
| Atomic consistency | Single-document writes, no multi-doc transactions needed | Requires multi-document transactions or eventual consistency |
| Read performance | Loading a project/study gives everything needed | Requires joins ($lookup) or multiple queries |
| Complexity | No new collections, repositories, or indexes to manage | 5 new collections, new repository implementations, new index strategies |
| Data volume | Questions: small/bounded. Annotations: within 16MB per study | Only annotations could theoretically grow large, but current data is well within limits |
| Audit trail | Version arrays are more readable than cross-collection references | Opaque snapshot hashes require resolution to compare |
What we're NOT doing (and why):
| Original Proposal | Decision | Rationale |
|---|---|---|
pmQuestionVersion collection |
Embedded _versionHistory on AnnotationQuestion |
Small, bounded, project-scoped, atomic with question edits |
pmQuestionSet collection |
Embedded StageQuestionConfig on Stage |
Thin reference list, tightly coupled to stage lifecycle |
pmStageAssignment collection |
Embedded StageQuestionConfig with history on Stage |
1-5 configs per stage, no cross-project queries needed |
pmAnnotation collection |
Keep in Study.ExtractionInfo.Annotations |
Atomic consistency with study, working today, no evidence of size pressure |
pmReconciliation collection |
Embedded ReconciliationRecords in ExtractionInfo |
New feature, bounded records, atomic with annotations and sessions |
Schema Changes Overview¶
No New Collections¶
All changes are embedded enrichments to existing pmProject and pmStudy documents.
Modified Aggregates¶
| Aggregate | Collection | Change |
|---|---|---|
Project.AnnotationQuestion |
pmProject |
Gains CurrentVersionNumber, _versionHistory |
Project.Stage |
pmProject |
Gains QuestionConfig, QuestionConfigHistory (replaces AnnotationQuestions HashSet) |
Study.Annotation |
pmStudy |
Gains QuestionVersionNumber, StageConfigVersion |
Study.ExtractionInfo |
pmStudy |
Gains ReconciliationRecords list |
Detailed Schema Designs¶
AnnotationQuestion — Version History¶
Each AnnotationQuestion gains a version history. The question's current properties ARE the latest version; previous states are stored as immutable snapshots.
This follows the existing v0/v1 schema pattern already used by AnnotationQuestion (backing fields like _v0Options/_v1Options in AnnotationQuestion.cs).
New C# types:
// Immutable snapshot of a question at a point in time
public class QuestionVersionSnapshot
{
public int VersionNumber { get; init; }
public string Question { get; init; }
public string QuestionType { get; init; }
public string ControlType { get; init; }
public string Category { get; init; }
public bool Optional { get; init; }
public bool Multiple { get; init; }
public bool AnswerArray { get; init; }
public ImmutableList<QuestionOption>? Options { get; init; }
public Target? Target { get; init; }
public List<Guid> SubquestionIds { get; init; }
public string? Description { get; init; }
public DateTime CreatedAt { get; init; }
public Guid CreatedBy { get; init; }
}
New properties on AnnotationQuestion:
public int CurrentVersionNumber { get; private set; } = 1;
private List<QuestionVersionSnapshot> _versionHistory = new();
public IReadOnlyList<QuestionVersionSnapshot> VersionHistory => _versionHistory.AsReadOnly();
MongoDB representation (embedded within pmProject._annotationQuestions[]):
{
_id: CSUUID("..."),
Question: "What was the sample size?",
QuestionType: "integer",
ControlType: "textbox",
// ... all existing fields unchanged ...
CurrentVersionNumber: 2,
_versionHistory: [
{
VersionNumber: 1,
Question: "Sample size?", // Original text before edit
QuestionType: "integer",
ControlType: "textbox",
Category: "Study",
Optional: false,
Multiple: false,
AnswerArray: false,
Options: null,
Target: null,
SubquestionIds: [],
Description: "Number of subjects",
CreatedAt: ISODate("2026-01-15T10:00:00Z"),
CreatedBy: CSUUID("...")
}
]
}
Behaviour: When AnnotationQuestion.Update() is called, the current state is snapshotted into _versionHistory before applying the edit, then CurrentVersionNumber is incremented.
Stage — Question Configuration¶
Replace the flat HashSet<Guid> AnnotationQuestions with a versioned configuration that tracks which question version is assigned at each position.
New C# types:
public class StageQuestionEntry
{
public Guid QuestionId { get; init; }
public int VersionNumber { get; init; }
public int Position { get; init; }
}
public class StageQuestionConfig
{
public int Version { get; init; } // Config version (increments on change)
public List<StageQuestionEntry> Questions { get; init; }
public DateTime EffectiveFrom { get; init; }
public Guid? UpdatedBy { get; init; }
}
New properties on Stage:
public StageQuestionConfig QuestionConfig { get; private set; }
public List<StageQuestionConfig> QuestionConfigHistory { get; private set; } = new();
Backward compatibility: The existing AnnotationQuestions HashSet becomes a computed property:
public HashSet<Guid> AnnotationQuestions =>
new(QuestionConfig?.Questions.Select(q => q.QuestionId) ?? Enumerable.Empty<Guid>());
MongoDB representation (embedded within pmProject.Stages[]):
{
_id: CSUUID("..."),
Name: "Screening",
Active: true,
Extraction: true,
// ... existing fields ...
QuestionConfig: {
Version: 2,
Questions: [
{ QuestionId: CSUUID("..."), VersionNumber: 2, Position: 0 },
{ QuestionId: CSUUID("..."), VersionNumber: 1, Position: 1 },
{ QuestionId: CSUUID("..."), VersionNumber: 1, Position: 2 }
],
EffectiveFrom: ISODate("2026-03-01T00:00:00Z"),
UpdatedBy: CSUUID("...")
},
QuestionConfigHistory: [
{
Version: 1,
Questions: [
{ QuestionId: CSUUID("..."), VersionNumber: 1, Position: 0 },
{ QuestionId: CSUUID("..."), VersionNumber: 1, Position: 1 }
],
EffectiveFrom: ISODate("2026-01-01T00:00:00Z"),
UpdatedBy: CSUUID("...")
}
]
}
Annotation — Version References¶
Add version tracking to the existing embedded Annotation base class to link each answer to the exact question version and stage configuration active when it was created.
New properties on Annotation:
public int QuestionVersionNumber { get; private set; } = 1;
public int StageConfigVersion { get; private set; } = 1;
MongoDB representation (embedded within pmStudy.extractionInfo.annotations[]):
{
_id: CSUUID("..."),
StudyId: CSUUID("..."),
StageId: CSUUID("..."),
AnnotatorId: CSUUID("..."),
QuestionId: CSUUID("..."),
Question: "What was the sample size?",
AnswerType: "IntAnnotation",
Answer: 42,
Root: true,
Reconciled: false,
Notes: "From Table 1",
// NEW fields:
QuestionVersionNumber: 2,
StageConfigVersion: 1
}
ExtractionInfo — Reconciliation Records¶
Add structured reconciliation tracking. Builds on the existing Reconciled flag on annotations and Reconciliation flag on sessions.
New C# types:
public enum ReconciliationStatus
{
Pending = 0,
Resolved = 1,
Deferred = 2
}
public class ReconciliationRecord
{
public Guid Id { get; init; }
public Guid StageId { get; init; }
public int StageConfigVersion { get; init; }
public Guid QuestionId { get; init; }
public int QuestionVersionNumber { get; init; }
public ReconciliationStatus Status { get; set; }
public string? Resolution { get; set; }
public Guid? ResolvedBy { get; set; }
public DateTime? ResolvedAt { get; set; }
public List<Guid> OriginalAnnotationIds { get; init; }
public Guid? ConsensusAnnotationId { get; set; }
}
New property on ExtractionInfo:
MongoDB representation (embedded within pmStudy.extractionInfo):
{
Annotations: [ /* existing */ ],
Sessions: [ /* existing */ ],
OutcomeData: [ /* existing */ ],
ReconciliationRecords: [
{
_id: CSUUID("..."),
StageId: CSUUID("..."),
StageConfigVersion: 1,
QuestionId: CSUUID("..."),
QuestionVersionNumber: 2,
Status: "Pending",
Resolution: null,
ResolvedBy: null,
ResolvedAt: null,
OriginalAnnotationIds: [CSUUID("..."), CSUUID("...")],
ConsensusAnnotationId: null
}
]
}
Migration Plan¶
Migration Is Minimal¶
Because all changes are additive embedded fields with sensible defaults, migration is dramatically simpler than the original separate-collection proposal. No data moves between collections.
Step 1: Backfill Question Version Numbers (Low risk)¶
For each existing AnnotationQuestion in each project, set CurrentVersionNumber = 1 and _versionHistory = []:
db.pmProject.updateMany(
{},
{
$set: {
"_annotationQuestions.$[].CurrentVersionNumber": 1,
"_annotationQuestions.$[]._versionHistory": []
}
}
)
Validation: All questions have CurrentVersionNumber = 1.
Alternative: Lazy migration — set defaults on read in the C# driver. The MongoDB driver can handle missing fields gracefully with default values.
Step 2: Backfill Stage Question Configs (Low risk)¶
For each stage, generate QuestionConfig v1 from the existing AnnotationQuestions HashSet:
// Pseudocode — requires application-level script
for each project in pmProject:
for each stage in project.Stages:
stage.QuestionConfig = {
Version: 1,
Questions: stage.AnnotationQuestions.map((qId, index) => ({
QuestionId: qId,
VersionNumber: 1,
Position: index
})),
EffectiveFrom: stage.CreatedAt || project.CreatedAt,
UpdatedBy: null // System migration
}
stage.QuestionConfigHistory = []
Validation: Every stage has a QuestionConfig with version 1. Question count matches original AnnotationQuestions HashSet size.
Step 3: Backfill Annotation Version References (Low risk)¶
Set version references on all existing annotations:
db.pmStudy.updateMany(
{ "extractionInfo.annotations": { $exists: true, $ne: [] } },
{
$set: {
"extractionInfo.annotations.$[].QuestionVersionNumber": 1,
"extractionInfo.annotations.$[].StageConfigVersion": 1
}
}
)
Validation: All annotations have QuestionVersionNumber = 1 and StageConfigVersion = 1.
Step 4: Initialise Empty Reconciliation Records (No-op)¶
No existing reconciliation records to migrate — this is a new feature. The ReconciliationRecords list will be empty for all existing studies and populated as reconciliation workflows run.
No migration script needed — the C# model defaults to new().
Rollback Plan¶
Each step has an independent rollback:
| Step | Rollback Action |
|---|---|
| Step 1 (Question Versions) | $unset the CurrentVersionNumber and _versionHistory fields |
| Step 2 (Stage Configs) | $unset the QuestionConfig and QuestionConfigHistory fields |
| Step 3 (Annotation Refs) | $unset the QuestionVersionNumber and StageConfigVersion fields |
Critical: All steps are additive — they add new fields without modifying existing data. Existing fields (AnnotationQuestions HashSet, Annotations array, etc.) are untouched. Rollback at any point means removing the new fields and reverting application code.
Data Integrity Checks¶
Pre-Migration Checks¶
// Verify no duplicate question IDs within a project
db.pmProject.aggregate([
{ $unwind: "$_annotationQuestions" },
{ $group: { _id: { projectId: "$_id", questionId: "$_annotationQuestions._id" }, count: { $sum: 1 } } },
{ $match: { count: { $gt: 1 } } }
])
// Verify all annotation questionIds reference valid questions
db.pmStudy.aggregate([
{ $unwind: "$extractionInfo.annotations" },
{ $lookup: {
from: "pmProject",
let: { qId: "$extractionInfo.annotations.QuestionId", pId: "$ProjectId" },
pipeline: [
{ $match: { $expr: { $eq: ["$_id", "$$pId"] } } },
{ $unwind: "$_annotationQuestions" },
{ $match: { $expr: { $eq: ["$_annotationQuestions._id", "$$qId"] } } }
],
as: "matchedQuestion"
}},
{ $match: { matchedQuestion: { $size: 0 } } },
{ $count: "orphanAnnotations" }
])
Post-Migration Checks¶
// 1. Verify all questions have version number
db.pmProject.aggregate([
{ $unwind: "$_annotationQuestions" },
{ $match: { "_annotationQuestions.CurrentVersionNumber": { $exists: false } } },
{ $count: "missingVersionNumber" }
])
// Should return 0
// 2. Verify all stages have QuestionConfig
db.pmProject.aggregate([
{ $unwind: "$Stages" },
{ $match: { "Stages.QuestionConfig": { $exists: false } } },
{ $count: "missingConfig" }
])
// Should return 0
// 3. Verify all annotations have version references
db.pmStudy.aggregate([
{ $unwind: "$extractionInfo.annotations" },
{ $match: { "extractionInfo.annotations.QuestionVersionNumber": { $exists: false } } },
{ $count: "missingVersionRef" }
])
// Should return 0
Document Size Monitoring & Guardrails¶
Since we're deliberately keeping data embedded, we must actively monitor document growth to catch problems before hitting MongoDB's 16MB limit.
Monitoring Queries¶
Run these periodically (weekly or as part of a Quartz health-check job):
// Project document sizes (sorted largest first)
db.pmProject.aggregate([
{ $project: {
name: "$Name",
sizeBytes: { $bsonSize: "$$ROOT" },
questionCount: { $size: { $ifNull: ["$_annotationQuestions", []] } },
stageCount: { $size: { $ifNull: ["$Stages", []] } },
totalVersionSnapshots: {
$sum: {
$map: {
input: { $ifNull: ["$_annotationQuestions", []] },
as: "q",
in: { $size: { $ifNull: ["$$q._versionHistory", []] } }
}
}
}
}},
{ $sort: { sizeBytes: -1 } },
{ $limit: 20 }
])
// Study document sizes (sorted largest first)
db.pmStudy.aggregate([
{ $project: {
projectId: "$ProjectId",
sizeBytes: { $bsonSize: "$$ROOT" },
annotationCount: { $size: { $ifNull: ["$extractionInfo.annotations", []] } },
sessionCount: { $size: { $ifNull: ["$extractionInfo.sessions", []] } },
reconciliationCount: { $size: { $ifNull: ["$extractionInfo.reconciliationRecords", []] } }
}},
{ $sort: { sizeBytes: -1 } },
{ $limit: 20 }
])
Alert Thresholds¶
| Threshold | Action |
|---|---|
| Any document > 4 MB | Warning — investigate which project/study and why |
| Any document > 8 MB | Critical — plan extraction to separate collection for the growing entity |
| Any document > 12 MB | Emergency — migrate immediately before hitting 16MB hard limit |
| Annotation count per study > 5,000 | Warning — study accumulating unusually many annotations |
| Question version history > 50 versions | Warning — question being edited excessively |
Quartz Health-Check Job¶
A scheduled job in the Quartz service should run the monitoring queries weekly and log results. If any threshold is breached, emit a warning to Sentry with the affected document IDs.
Future Migration Escape Hatch¶
Repository Abstraction¶
Annotation and question reads/writes go through repository interfaces in SyRF.ProjectManagement.Mongo.Data/Repositories/. If extraction to separate collections becomes necessary:
- Create the new collection (e.g.,
pmAnnotation) - Write a new repository implementation reading/writing to the separate collection
- Swap the DI registration
- Run a one-time migration script to copy embedded data to the new collection
- Domain model and service layer remain unchanged
What Would Trigger Migration¶
| Signal | Likely Entity to Extract |
|---|---|
| Study documents exceeding 8MB | Annotations (out of pmStudy) |
| Projects with 500+ questions and frequent edits | Version history (out of pmProject) |
| Reconciliation records growing large per study | Reconciliation records (out of pmStudy) |
| Cross-entity query performance degrading | Whichever entity is being queried across aggregates |
Version Reference Stability¶
The QuestionVersionNumber and StageConfigVersion fields on annotations act as stable references even in embedded form. If questions or configs are later extracted to separate collections, these version numbers remain valid foreign keys — no annotation data needs to change.
Environment Strategy¶
| Environment | Approach |
|---|---|
| Development | Run migration on local MongoDB; iterate freely |
| Staging | Run migration on staging database (currently shares prod syrftest!) |
| Production | Run migration on syrftest with brief maintenance window |
WARNING: Staging and production currently share the same MongoDB database (
syrftest). Migration on staging IS migration on production. See MongoDB Testing Strategy for planned isolation.
Migration Execution Plan¶
- Week 1: Run on development with test data. Validate all checks pass.
- Week 2: Run read-only audit on
syrftest(production). Generate baseline document size report. - Week 3: Run Steps 1-3 on
syrftest(additive fields only, zero risk to existing data). Validate. - Week 4: Deploy application code that reads the new fields. Monitor for 48 hours.
- Ongoing: Weekly monitoring job runs automatically via Quartz.
GUID Representation Reminder¶
All IDs in MongoDB are stored as CSUUID (C# Legacy GUID format) — BinData subtype 3. Migration scripts must use CSUUID() format in queries, not standard UUID().
See CLAUDE.md MongoDB section for details.
Acceptance Criteria¶
- Pre-migration audit report generated and reviewed
- Baseline document size report generated
- Question version fields backfilled (
CurrentVersionNumber,_versionHistory) - Stage question configs generated from existing
AnnotationQuestionsHashSets - Annotation version references backfilled (
QuestionVersionNumber,StageConfigVersion) - All post-migration validation checks pass
- Backward compatibility confirmed: existing API endpoints work with enriched model
- Monitoring queries documented and tested
- Quartz health-check job implemented and running
- Rollback tested on staging