Shared: Data Model & Migration Plan¶

Summary¶

This document describes the schema enrichments to existing MongoDB aggregates required to support versioned questions, stage-level question configuration, and reconciliation records. It also details the migration strategy for existing data.

Parent: Annotation Management & Reconciliation

⚠️ Precedence note: This document predates the Design Decisions (D1–D50) and the Annotation Versioning Design (D37–D50). Where this document contradicts those authoritative references, the design decisions take precedence. In particular:

The Reconciliation Records schema (embedded ReconciliationRecord in Study.ExtractionInfo) has been superseded by the standalone Reconciliation Session model (D5). The migration approach and tooling remain valid; the target schema needs updating.

The embedded-first, extract-if-needed design principle, migration strategy (additive fields, rollback-by-unset), monitoring queries, and environment strategy remain current and applicable to Phase 7.

This document was merged from the closed PR #2325 branch to preserve implementation detail not carried forward during spec restructuring. It should be reconciled with the design decisions before Phase 7 execution.

Design Principles¶

Embedded-First, Extract-If-Needed¶

After design review, we chose to keep all data embedded in existing aggregates rather than creating new collections. The rationale:

Principle	Embedded	Separate Collections
Atomic consistency	Single-document writes, no multi-doc transactions needed	Requires multi-document transactions or eventual consistency
Read performance	Loading a project/study gives everything needed	Requires joins (`$lookup`) or multiple queries
Complexity	No new collections, repositories, or indexes to manage	5 new collections, new repository implementations, new index strategies
Data volume	Questions: small/bounded. Annotations: within 16MB per study	Only annotations could theoretically grow large, but current data is well within limits
Audit trail	Version arrays are more readable than cross-collection references	Opaque snapshot hashes require resolution to compare

What we're NOT doing (and why):

Original Proposal	Decision	Rationale
`pmQuestionVersion` collection	Embedded `_versionHistory` on `AnnotationQuestion`	Small, bounded, project-scoped, atomic with question edits
`pmQuestionSet` collection	Embedded `StageQuestionConfig` on `Stage`	Thin reference list, tightly coupled to stage lifecycle
`pmStageAssignment` collection	Embedded `StageQuestionConfig` with history on `Stage`	1-5 configs per stage, no cross-project queries needed
`pmAnnotation` collection	Keep in `Study.ExtractionInfo.Annotations`	Atomic consistency with study, working today, no evidence of size pressure
`pmReconciliation` collection	Embedded `ReconciliationRecords` in `ExtractionInfo`	New feature, bounded records, atomic with annotations and sessions

Schema Changes Overview¶

No New Collections¶

All changes are embedded enrichments to existing pmProject and pmStudy documents.

Modified Aggregates¶

Aggregate	Collection	Change
`Project.AnnotationQuestion`	`pmProject`	Gains `CurrentVersionNumber`, `_versionHistory`
`Project.Stage`	`pmProject`	Gains `QuestionConfig`, `QuestionConfigHistory` (replaces `AnnotationQuestions` HashSet)
`Study.Annotation`	`pmStudy`	Gains `QuestionVersionNumber`, `StageConfigVersion`
`Study.ExtractionInfo`	`pmStudy`	Gains `ReconciliationRecords` list

Detailed Schema Designs¶

AnnotationQuestion — Version History¶

Each AnnotationQuestion gains a version history. The question's current properties ARE the latest version; previous states are stored as immutable snapshots.

This follows the existing v0/v1 schema pattern already used by AnnotationQuestion (backing fields like _v0Options/_v1Options in AnnotationQuestion.cs).

New C# types:

// Immutable snapshot of a question at a point in time
public class QuestionVersionSnapshot
{
    public int VersionNumber { get; init; }
    public string Question { get; init; }
    public string QuestionType { get; init; }
    public string ControlType { get; init; }
    public string Category { get; init; }
    public bool Optional { get; init; }
    public bool Multiple { get; init; }
    public bool AnswerArray { get; init; }
    public ImmutableList<QuestionOption>? Options { get; init; }
    public Target? Target { get; init; }
    public List<Guid> SubquestionIds { get; init; }
    public string? Description { get; init; }
    public DateTime CreatedAt { get; init; }
    public Guid CreatedBy { get; init; }
}

New properties on AnnotationQuestion:

public int CurrentVersionNumber { get; private set; } = 1;
private List<QuestionVersionSnapshot> _versionHistory = new();
public IReadOnlyList<QuestionVersionSnapshot> VersionHistory => _versionHistory.AsReadOnly();

MongoDB representation (embedded within pmProject._annotationQuestions[]):

{
  _id: CSUUID("..."),
  Question: "What was the sample size?",
  QuestionType: "integer",
  ControlType: "textbox",
  // ... all existing fields unchanged ...
  CurrentVersionNumber: 2,
  _versionHistory: [
    {
      VersionNumber: 1,
      Question: "Sample size?",           // Original text before edit
      QuestionType: "integer",
      ControlType: "textbox",
      Category: "Study",
      Optional: false,
      Multiple: false,
      AnswerArray: false,
      Options: null,
      Target: null,
      SubquestionIds: [],
      Description: "Number of subjects",
      CreatedAt: ISODate("2026-01-15T10:00:00Z"),
      CreatedBy: CSUUID("...")
    }
  ]
}

Behaviour: When AnnotationQuestion.Update() is called, the current state is snapshotted into _versionHistory before applying the edit, then CurrentVersionNumber is incremented.

Stage — Question Configuration¶

Replace the flat HashSet<Guid> AnnotationQuestions with a versioned configuration that tracks which question version is assigned at each position.

New C# types:

public class StageQuestionEntry
{
    public Guid QuestionId { get; init; }
    public int VersionNumber { get; init; }
    public int Position { get; init; }
}

public class StageQuestionConfig
{
    public int Version { get; init; }              // Config version (increments on change)
    public List<StageQuestionEntry> Questions { get; init; }
    public DateTime EffectiveFrom { get; init; }
    public Guid? UpdatedBy { get; init; }
}

New properties on Stage:

public StageQuestionConfig QuestionConfig { get; private set; }
public List<StageQuestionConfig> QuestionConfigHistory { get; private set; } = new();

Backward compatibility: The existing AnnotationQuestions HashSet becomes a computed property:

public HashSet<Guid> AnnotationQuestions =>
    new(QuestionConfig?.Questions.Select(q => q.QuestionId) ?? Enumerable.Empty<Guid>());

MongoDB representation (embedded within pmProject.Stages[]):

{
  _id: CSUUID("..."),
  Name: "Screening",
  Active: true,
  Extraction: true,
  // ... existing fields ...
  QuestionConfig: {
    Version: 2,
    Questions: [
      { QuestionId: CSUUID("..."), VersionNumber: 2, Position: 0 },
      { QuestionId: CSUUID("..."), VersionNumber: 1, Position: 1 },
      { QuestionId: CSUUID("..."), VersionNumber: 1, Position: 2 }
    ],
    EffectiveFrom: ISODate("2026-03-01T00:00:00Z"),
    UpdatedBy: CSUUID("...")
  },
  QuestionConfigHistory: [
    {
      Version: 1,
      Questions: [
        { QuestionId: CSUUID("..."), VersionNumber: 1, Position: 0 },
        { QuestionId: CSUUID("..."), VersionNumber: 1, Position: 1 }
      ],
      EffectiveFrom: ISODate("2026-01-01T00:00:00Z"),
      UpdatedBy: CSUUID("...")
    }
  ]
}

Annotation — Version References¶

Add version tracking to the existing embedded Annotation base class to link each answer to the exact question version and stage configuration active when it was created.

New properties on Annotation:

public int QuestionVersionNumber { get; private set; } = 1;
public int StageConfigVersion { get; private set; } = 1;

MongoDB representation (embedded within pmStudy.extractionInfo.annotations[]):

{
  _id: CSUUID("..."),
  StudyId: CSUUID("..."),
  StageId: CSUUID("..."),
  AnnotatorId: CSUUID("..."),
  QuestionId: CSUUID("..."),
  Question: "What was the sample size?",
  AnswerType: "IntAnnotation",
  Answer: 42,
  Root: true,
  Reconciled: false,
  Notes: "From Table 1",
  // NEW fields:
  QuestionVersionNumber: 2,
  StageConfigVersion: 1
}

ExtractionInfo — Reconciliation Records¶

Add structured reconciliation tracking. Builds on the existing Reconciled flag on annotations and Reconciliation flag on sessions.

New C# types:

public enum ReconciliationStatus
{
    Pending = 0,
    Resolved = 1,
    Deferred = 2
}

public class ReconciliationRecord
{
    public Guid Id { get; init; }
    public Guid StageId { get; init; }
    public int StageConfigVersion { get; init; }
    public Guid QuestionId { get; init; }
    public int QuestionVersionNumber { get; init; }
    public ReconciliationStatus Status { get; set; }
    public string? Resolution { get; set; }
    public Guid? ResolvedBy { get; set; }
    public DateTime? ResolvedAt { get; set; }
    public List<Guid> OriginalAnnotationIds { get; init; }
    public Guid? ConsensusAnnotationId { get; set; }
}

New property on ExtractionInfo:

public List<ReconciliationRecord> ReconciliationRecords { get; private set; } = new();

MongoDB representation (embedded within pmStudy.extractionInfo):

{
  Annotations: [ /* existing */ ],
  Sessions: [ /* existing */ ],
  OutcomeData: [ /* existing */ ],
  ReconciliationRecords: [
    {
      _id: CSUUID("..."),
      StageId: CSUUID("..."),
      StageConfigVersion: 1,
      QuestionId: CSUUID("..."),
      QuestionVersionNumber: 2,
      Status: "Pending",
      Resolution: null,
      ResolvedBy: null,
      ResolvedAt: null,
      OriginalAnnotationIds: [CSUUID("..."), CSUUID("...")],
      ConsensusAnnotationId: null
    }
  ]
}

Migration Plan¶

Migration Is Minimal¶

Because all changes are additive embedded fields with sensible defaults, migration is dramatically simpler than the original separate-collection proposal. No data moves between collections.

Step 1: Backfill Question Version Numbers (Low risk)¶

For each existing AnnotationQuestion in each project, set CurrentVersionNumber = 1 and _versionHistory = []:

db.pmProject.updateMany(
  {},
  {
    $set: {
      "_annotationQuestions.$[].CurrentVersionNumber": 1,
      "_annotationQuestions.$[]._versionHistory": []
    }
  }
)

Validation: All questions have CurrentVersionNumber = 1.

Alternative: Lazy migration — set defaults on read in the C# driver. The MongoDB driver can handle missing fields gracefully with default values.

Step 2: Backfill Stage Question Configs (Low risk)¶

For each stage, generate QuestionConfig v1 from the existing AnnotationQuestions HashSet:

// Pseudocode — requires application-level script
for each project in pmProject:
  for each stage in project.Stages:
    stage.QuestionConfig = {
      Version: 1,
      Questions: stage.AnnotationQuestions.map((qId, index) => ({
        QuestionId: qId,
        VersionNumber: 1,
        Position: index
      })),
      EffectiveFrom: stage.CreatedAt || project.CreatedAt,
      UpdatedBy: null  // System migration
    }
    stage.QuestionConfigHistory = []

Validation: Every stage has a QuestionConfig with version 1. Question count matches original AnnotationQuestions HashSet size.

Step 3: Backfill Annotation Version References (Low risk)¶

Set version references on all existing annotations:

db.pmStudy.updateMany(
  { "extractionInfo.annotations": { $exists: true, $ne: [] } },
  {
    $set: {
      "extractionInfo.annotations.$[].QuestionVersionNumber": 1,
      "extractionInfo.annotations.$[].StageConfigVersion": 1
    }
  }
)

Validation: All annotations have QuestionVersionNumber = 1 and StageConfigVersion = 1.

Step 4: Initialise Empty Reconciliation Records (No-op)¶

No existing reconciliation records to migrate — this is a new feature. The ReconciliationRecords list will be empty for all existing studies and populated as reconciliation workflows run.

No migration script needed — the C# model defaults to new().

Rollback Plan¶

Each step has an independent rollback:

Step	Rollback Action
Step 1 (Question Versions)	`$unset` the `CurrentVersionNumber` and `_versionHistory` fields
Step 2 (Stage Configs)	`$unset` the `QuestionConfig` and `QuestionConfigHistory` fields
Step 3 (Annotation Refs)	`$unset` the `QuestionVersionNumber` and `StageConfigVersion` fields

Critical: All steps are additive — they add new fields without modifying existing data. Existing fields (AnnotationQuestions HashSet, Annotations array, etc.) are untouched. Rollback at any point means removing the new fields and reverting application code.

Data Integrity Checks¶

Pre-Migration Checks¶

// Verify no duplicate question IDs within a project
db.pmProject.aggregate([
  { $unwind: "$_annotationQuestions" },
  { $group: { _id: { projectId: "$_id", questionId: "$_annotationQuestions._id" }, count: { $sum: 1 } } },
  { $match: { count: { $gt: 1 } } }
])

// Verify all annotation questionIds reference valid questions
db.pmStudy.aggregate([
  { $unwind: "$extractionInfo.annotations" },
  { $lookup: {
      from: "pmProject",
      let: { qId: "$extractionInfo.annotations.QuestionId", pId: "$ProjectId" },
      pipeline: [
        { $match: { $expr: { $eq: ["$_id", "$$pId"] } } },
        { $unwind: "$_annotationQuestions" },
        { $match: { $expr: { $eq: ["$_annotationQuestions._id", "$$qId"] } } }
      ],
      as: "matchedQuestion"
  }},
  { $match: { matchedQuestion: { $size: 0 } } },
  { $count: "orphanAnnotations" }
])

Post-Migration Checks¶

// 1. Verify all questions have version number
db.pmProject.aggregate([
  { $unwind: "$_annotationQuestions" },
  { $match: { "_annotationQuestions.CurrentVersionNumber": { $exists: false } } },
  { $count: "missingVersionNumber" }
])
// Should return 0

// 2. Verify all stages have QuestionConfig
db.pmProject.aggregate([
  { $unwind: "$Stages" },
  { $match: { "Stages.QuestionConfig": { $exists: false } } },
  { $count: "missingConfig" }
])
// Should return 0

// 3. Verify all annotations have version references
db.pmStudy.aggregate([
  { $unwind: "$extractionInfo.annotations" },
  { $match: { "extractionInfo.annotations.QuestionVersionNumber": { $exists: false } } },
  { $count: "missingVersionRef" }
])
// Should return 0

Document Size Monitoring & Guardrails¶

Since we're deliberately keeping data embedded, we must actively monitor document growth to catch problems before hitting MongoDB's 16MB limit.

Monitoring Queries¶

Run these periodically (weekly or as part of a Quartz health-check job):

// Project document sizes (sorted largest first)
db.pmProject.aggregate([
  { $project: {
      name: "$Name",
      sizeBytes: { $bsonSize: "$$ROOT" },
      questionCount: { $size: { $ifNull: ["$_annotationQuestions", []] } },
      stageCount: { $size: { $ifNull: ["$Stages", []] } },
      totalVersionSnapshots: {
        $sum: {
          $map: {
            input: { $ifNull: ["$_annotationQuestions", []] },
            as: "q",
            in: { $size: { $ifNull: ["$$q._versionHistory", []] } }
          }
        }
      }
  }},
  { $sort: { sizeBytes: -1 } },
  { $limit: 20 }
])

// Study document sizes (sorted largest first)
db.pmStudy.aggregate([
  { $project: {
      projectId: "$ProjectId",
      sizeBytes: { $bsonSize: "$$ROOT" },
      annotationCount: { $size: { $ifNull: ["$extractionInfo.annotations", []] } },
      sessionCount: { $size: { $ifNull: ["$extractionInfo.sessions", []] } },
      reconciliationCount: { $size: { $ifNull: ["$extractionInfo.reconciliationRecords", []] } }
  }},
  { $sort: { sizeBytes: -1 } },
  { $limit: 20 }
])

Alert Thresholds¶

Threshold	Action
Any document > 4 MB	Warning — investigate which project/study and why
Any document > 8 MB	Critical — plan extraction to separate collection for the growing entity
Any document > 12 MB	Emergency — migrate immediately before hitting 16MB hard limit
Annotation count per study > 5,000	Warning — study accumulating unusually many annotations
Question version history > 50 versions	Warning — question being edited excessively

Quartz Health-Check Job¶

A scheduled job in the Quartz service should run the monitoring queries weekly and log results. If any threshold is breached, emit a warning to Sentry with the affected document IDs.

Future Migration Escape Hatch¶

Repository Abstraction¶

Annotation and question reads/writes go through repository interfaces in SyRF.ProjectManagement.Mongo.Data/Repositories/. If extraction to separate collections becomes necessary:

Create the new collection (e.g., pmAnnotation)
Write a new repository implementation reading/writing to the separate collection
Swap the DI registration
Run a one-time migration script to copy embedded data to the new collection
Domain model and service layer remain unchanged

What Would Trigger Migration¶

Signal	Likely Entity to Extract
Study documents exceeding 8MB	Annotations (out of `pmStudy`)
Projects with 500+ questions and frequent edits	Version history (out of `pmProject`)
Reconciliation records growing large per study	Reconciliation records (out of `pmStudy`)
Cross-entity query performance degrading	Whichever entity is being queried across aggregates

Version Reference Stability¶

The QuestionVersionNumber and StageConfigVersion fields on annotations act as stable references even in embedded form. If questions or configs are later extracted to separate collections, these version numbers remain valid foreign keys — no annotation data needs to change.

Environment Strategy¶

Environment	Approach
Development	Run migration on local MongoDB; iterate freely
Staging	Run migration on staging database (currently shares prod `syrftest`!)
Production	Run migration on `syrftest` with brief maintenance window

WARNING: Staging and production currently share the same MongoDB database (syrftest). Migration on staging IS migration on production. See MongoDB Testing Strategy for planned isolation.

Migration Execution Plan¶

Week 1: Run on development with test data. Validate all checks pass.
Week 2: Run read-only audit on syrftest (production). Generate baseline document size report.
Week 3: Run Steps 1-3 on syrftest (additive fields only, zero risk to existing data). Validate.
Week 4: Deploy application code that reads the new fields. Monitor for 48 hours.
Ongoing: Weekly monitoring job runs automatically via Quartz.

GUID Representation Reminder¶

All IDs in MongoDB are stored as CSUUID (C# Legacy GUID format) — BinData subtype 3. Migration scripts must use CSUUID() format in queries, not standard UUID().

See CLAUDE.md MongoDB section for details.

Shared: Data Model & Migration Plan¶

Summary¶

Design Principles¶

Embedded-First, Extract-If-Needed¶

Schema Changes Overview¶

No New Collections¶

Modified Aggregates¶

Detailed Schema Designs¶

AnnotationQuestion — Version History¶

Stage — Question Configuration¶

Annotation — Version References¶

ExtractionInfo — Reconciliation Records¶

Migration Plan¶

Migration Is Minimal¶

Step 1: Backfill Question Version Numbers (Low risk)¶

Step 2: Backfill Stage Question Configs (Low risk)¶

Step 3: Backfill Annotation Version References (Low risk)¶

Step 4: Initialise Empty Reconciliation Records (No-op)¶

Rollback Plan¶

Data Integrity Checks¶

Pre-Migration Checks¶

Post-Migration Checks¶

Document Size Monitoring & Guardrails¶

Monitoring Queries¶

Alert Thresholds¶

Quartz Health-Check Job¶

Future Migration Escape Hatch¶

Repository Abstraction¶

What Would Trigger Migration¶

Version Reference Stability¶

Environment Strategy¶

Migration Execution Plan¶

GUID Representation Reminder¶

Acceptance Criteria¶