Skip to content

Annotation Import Feature Design

Overview

Allow project administrators to import annotations from external tools into an existing SyRF project stage. The primary use case is migration from another systematic review tool (e.g., Rayyan, Covidence, custom spreadsheets) where researchers have already coded data and want to bring it into SyRF.

Roadmap placement: Phase 6.5 — inserted between Phase 6 (Question Management UI) and Phase 7 (Release 1 Data Migration). Targets the AnnotationQuestionV2 model from the QM v2 work landing on PRs #2572–#2575.

Implementation alignment: This spec predates the final QM v2 domain model. Where this document refers to "AQ", "AQVersion", "QSV", and their reference types, the authoritative current types are:

  • AnnotationQuestionV2 (class) with its embedded List<AQVersion> — each AQVersion has its own Guid Id (the reference target) plus an int VersionNumber for ordering.
  • ProjectQuestionSet (PQS) — project-wide, owns PQSVersion history.
  • StageQuestionSet (SQS) — per-stage subset of a specific PQSVersion.
  • Cross-aggregate references use typed composite value-object records defined in VersioningValueObjects.cs:
  • AnnotationQuestionVersionReference(Guid QuestionId, Guid VersionId)
  • StageQuestionSetVersionReference(Guid StageId, Guid VersionId)
  • AnnotationVersionReference(Guid AnnotationId, Guid VersionId)
  • AnnotationSessionVersionReference(Guid SessionId, Guid VersionId)

All imported annotations must be pinned to a specific AQVersion via AnnotationQuestionVersionReference, and the owning AnnotationSession pinned to a specific StageQuestionSetVersion via StageQuestionSetVersionReference, before they are written.


User Journey

Step 1 — Setup

Admin navigates to a stage and opens "Import Annotations." Uploads a CSV, JSON, or YAML file. The server detects the format from file extension and content sniffing, parses structure synchronously (headers only for CSV; top-level shape for JSON/YAML), and returns immediately with:

  • Detected format
  • Parsed structure (column headers + 3-row sample for CSV; detected field names + sample record for JSON/YAML)
  • Stage question tree (questions as a hierarchy, for use in the mapping step)

Step 2 — Field Mapping

Admin maps three categories of fields:

  • Study identifier — which column/field contains the SyRF study ID or custom ID, and which type it is (SyRFStudyId | CustomId)
  • Annotator identifier — which column/field identifies the annotator, and whether it's an email or investigator ID
  • Questions — maps source columns/fields to SyRF question IDs. The UI displays questions as a tree (not a flat list) to make parent-child relationships visible. Array-type questions are flagged with a semicolon-delimiter hint.

For JSON/YAML, the server attempts auto-mapping by matching JSON field names against question IDs (exact GUID match) and question text (case-insensitive). Matched questions are pre-filled; unmatched ones are presented for manual mapping or can be ignored.

Step 3 — Async Validation

On mapping submission the server enqueues a validation job (MassTransit consumer). Admin sees a "validating…" state. The job:

  1. Reads the file from GCS
  2. Parses all rows/records using the stored mapping
  3. Resolves study identifiers against project studies
  4. Resolves annotator identifiers against project members
  5. Reconstructs the annotation tree per (study, annotator) group — detecting structural issues
  6. Produces a AnnotationValidationResult (metadata only — no parsed annotation objects stored)

Step 4 — Conflict Resolution

Admin reviews a structured report in four categories:

Category Description Admin action
Annotation conflicts Existing annotations for that annotator+study already exist in the target stage Overwrite or Skip per group
Conditional warnings Child question has a value but parent is blank or answered negatively Acknowledge or exclude per row
Orphaned children Child question answered, parent question absent entirely in the import data Promote to root or Skip per item
Unmatched / unresolved Study ID not found in project, or annotator not a project member Auto-excluded, informational only

Step 5 — Commit

Admin submits resolutions. The server:

  1. Re-reads the file from GCS using the stored mapping
  2. Applies resolutions (skip excluded groups/items)
  3. Resolves the target StageQuestionSetVersion (the stage's current published SQS version) and captures its StageQuestionSetVersionReference(StageId, VersionId)
  4. For each question mapping, resolves the current published AQVersion.Id on the mapped AnnotationQuestionV2 and stores its AnnotationQuestionVersionReference(QuestionId, VersionId)
  5. Writes annotations depth-first — parents created first, IDs captured, children linked via ParentId/Children — each Annotation/AnnotationVersion pair carrying the AnnotationQuestionVersionReference captured above
  6. For each (study, annotator) group, creates an AnnotationSession with the SQS version reference, plus one AnnotationSessionVersion (ASV) pinning each new Annotation to its first AnnotationVersion.Id
  7. Marks job complete, deletes GCS file

Authorization: Project administrator only.


Data Model

New collection: pmAnnotationImportJob

A new AggregateRoot<Guid> entity in SyRF.ProjectManagement.Core.Model.AnnotationImportJobAggregate, following the existing collection pattern (same as DataExportJob).

Status lifecycle

Created → MappingPending → Validating → ValidationComplete → Committing → Completed
                                                                         → Failed
                                                                         → Cancelled

Entity fields

public class AnnotationImportJob : AggregateRoot<Guid>
{
    public Guid ProjectId { get; set; }
    public Guid StageId { get; set; }
    public Guid CreatedByInvestigatorId { get; set; }
    public DateTime CreatedAt { get; set; }
    public DateTime? CompletedAt { get; set; }

    public ImportFileInfo File { get; set; }
    // { FileName, GcsObjectName, Format (CSV|JSON|YAML), SizeBytes }

    public ParsedStructure? ParsedStructure { get; set; }
    // CSV: { Headers: string[], SampleRows: string[][] }
    // JSON/YAML: { DetectedFields: string[], SampleRecord: object }

    public AnnotationImportMapping? Mapping { get; set; }
    public AnnotationValidationResult? ValidationResult { get; set; }
    public AnnotationImportResolutions? Resolutions { get; set; }
    public ImportStats? Stats { get; set; }
    // { AnnotationsImported, StudiesAffected, AnnotationsSkipped }

    public AnnotationImportStatus Status { get; set; }
    public string? ErrorMessage { get; set; }
}

Mapping (stored after Step 2)

public record AnnotationImportMapping(
    string StudyIdentifierField,
    StudyIdentifierType StudyIdentifierType,       // SyRFStudyId | CustomId
    string AnnotatorIdentifierField,
    AnnotatorIdentifierType AnnotatorIdentifierType, // Email | InvestigatorId
    IReadOnlyList<QuestionFieldMapping> QuestionMappings
);

public record QuestionFieldMapping(
    string SourceField,       // CSV column name or JSON key
    Guid QuestionId,          // SyRF question ID
    bool IsArrayType,         // Parse cell as semicolon-delimited list (CSV)
    string ArrayDelimiter     // Default ";"
);

Validation result (stored after async validation — metadata only)

The full parsed annotation data is not stored. The file remains in GCS and is re-read at commit time. This keeps the document size bounded for large imports.

public record AnnotationValidationResult(
    IReadOnlyList<CleanGroup> CleanGroups,
    // { StudyId, AnnotatorId, AnnotationCount }

    IReadOnlyList<AnnotationConflict> Conflicts,
    // { StudyId, AnnotatorId, ExistingCount, IncomingCount, Resolution: Pending }

    IReadOnlyList<ConditionalWarning> ConditionalWarnings,
    // { StudyId, AnnotatorId, ChildQuestionId, ParentQuestionId, ParentAnswer }

    IReadOnlyList<OrphanedChildWarning> OrphanedChildren,
    // { StudyId, AnnotatorId, OrphanedQuestionId, Resolution: Pending }

    IReadOnlyList<string> UnmatchedStudyIds,
    IReadOnlyList<string> UnresolvedAnnotators,
    int TotalAnnotationsToImport,
    int TotalStudiesAffected
);

Resolutions (stored after Step 4)

public record AnnotationImportResolutions(
    IReadOnlyList<ConflictResolution> ConflictResolutions,
    // { StudyId, AnnotatorId, Choice: Overwrite | Skip }

    IReadOnlyList<ConditionalWarningResolution> ConditionalWarningResolutions,
    // { StudyId, AnnotatorId, ChildQuestionId, Include: bool }

    IReadOnlyList<OrphanedChildResolution> OrphanedChildResolutions
    // { StudyId, AnnotatorId, OrphanedQuestionId, Choice: PromoteToRoot | Skip }
);

API

Five endpoints on the project-management service:

POST   /api/projects/{projectId}/stages/{stageId}/annotation-imports
       Multipart upload. Synchronously parses structure.
       Returns: { jobId, format, parsedStructure, stageQuestions (as tree) }

POST   /api/projects/{projectId}/annotation-imports/{jobId}/mapping
       Submits field mapping. Triggers async validation.
       Returns: 202 Accepted

GET    /api/projects/{projectId}/annotation-imports/{jobId}
       Polls for status + validation result.
       Returns: { status, validationResult?, errorMessage? }

POST   /api/projects/{projectId}/annotation-imports/{jobId}/confirm
       Submits resolutions. Synchronously commits annotations.
       Returns: { stats }

DELETE /api/projects/{projectId}/annotation-imports/{jobId}
       Cancels job, deletes GCS file.

Backend Architecture

Format parsing

A IAnnotationFileParser interface with three implementations, all producing a unified internal representation:

IAnnotationFileParser
  → CsvAnnotationParser    reads headers + column-mapped rows
  → JsonAnnotationParser   reads fields + nested annotation records
  → YamlAnnotationParser   converts YAML→JSON, delegates to JsonAnnotationParser

All parsers produce IEnumerable<ImportRecord>:

record ImportRecord(
    string StudyIdentifierRaw,
    string AnnotatorIdentifierRaw,
    IReadOnlyList<ImportAnnotation> Annotations
);

record ImportAnnotation(
    string QuestionKey,          // Source field name (resolved to QuestionId via mapping)
    string RawValue,             // Raw string value from source
    IReadOnlyList<ImportAnnotation> Children  // Nested (JSON/YAML) or empty (CSV)
);

Async validation — MassTransit consumer

Following the same pattern as CsvDataExportWorkerConsumer:

POST /mapping
  → stores mapping (Status: Validating)
  → publishes ValidateAnnotationImport { JobId }
  → returns 202

AnnotationImportValidationConsumer
  → loads job
  → reads file from GCS
  → parses via IAnnotationFileParser
  → resolves study IDs and annotators
  → validates tree structure per group (orphan + conditional detection)
  → writes ValidationResult to job (Status: ValidationComplete)

Tree reconstruction — commit time

Shared for all formats. Runs depth-first:

AnnotationTreeBuilder.Build(importRecord, stageQuestionTree, mapping, resolutions)
  → for each root question in stage question tree that has a mapping:
      create Annotation { Id = Guid.NewGuid(), Root = true, ParentId = null }
      for each child question (depth-first):
          if value present and not excluded by resolution:
              create Annotation { Id = Guid.NewGuid(), Root = false, ParentId = parent.Id }
              append Id to parent.Children
  → return flat List<Annotation> with tree links intact
  → pass to study.AddSessionData() (existing write path)

Tree validation — validation time

Detects issues using question IDs (not annotation IDs, which don't exist yet):

  • Conditional warning: child question key present in import record AND parent question key is blank or parent answer is the "negative" value for that question type (e.g., false for boolean, empty for string)
  • Orphaned child: child question key present AND parent question key is entirely absent from the import record

File storage — GCS

  • Service: IGcsStorageService in SyRF.AppServices — wraps Google.Cloud.Storage.V1
  • Auth: Workload Identity (pod service account — no credentials to manage)
  • Bucket: syrf-annotation-imports (configurable via GCS:AnnotationImportsBucket)
  • Lifecycle: Upload on POST → read on validation → read + delete on commit/cancel/failure
  • Note: Establishes GCS as the standard for new features. Existing S3 usage (study import Lambda pipeline) is unchanged and migrated separately.

Tree Handling: Key Considerations

The annotation tree is the primary complexity of this feature. The Annotation model uses ParentId (pointing to the parent annotation ID, not question ID) and Children (list of child annotation IDs). These IDs don't exist until commit time, making tree reconstruction a two-pass problem.

CSV-specific tree challenges

  • Flat representation: Wide format with one column per question. Parent-child relationships are inferred from the stage question tree + column mapping.
  • Array-type questions: BoolArrayAnnotation, StringArrayAnnotation, etc. are encoded as semicolon-separated values in a single cell (e.g., "Option A;Option C"). The column mapping step flags these and specifies the delimiter.
  • Conditional logic: If a parent question is answered negatively, child columns should be empty. A non-empty child column when parent is negative triggers a conditional warning.

JSON/YAML-specific tree handling

  • Nesting is native: The tree structure is explicit in the document. No inference needed.
  • Auto-mapping: JSON field names are matched against question IDs (exact GUID) and question text (case-insensitive) to pre-fill mappings.
  • YAML quirks: YAML type coercion (e.g., Yes parsed as boolean) must be handled in the parser before processing.

Depth

The algorithm handles arbitrary tree depth via recursive depth-first traversal. No hardcoded depth limit.


Import File Schema

JSON / YAML (canonical — supports full nesting)

# annotation-import.yaml
version: "1.0"
targetStageId: "<GUID>"                      # stage to import into
targetSqsVersionId: "<GUID>"                 # StageQuestionSetVersion this import is pinned to
annotatorIdField: "annotator"                # source field carrying annotator identifier
annotatorIdType: "email"                     # email | investigatorId
studyIdField: "studyId"                      # source field carrying study identifier
studyIdType: "syrfStudyId"                   # syrfStudyId | customId
questionMapping:                             # map source question refs → target AnnotationQuestionV2 IDs
  "source-q1": "<target-AnnotationQuestionV2-GUID>"
  "source-q2": "<target-AnnotationQuestionV2-GUID>"
annotations:
  - studyId: "<GUID-or-custom-id>"
    annotator: "researcher@example.org"
    answers:
      - questionRef: "source-q1"             # uses source ID, resolved via mapping
        answerType: "string"
        value: "Mouse"
        children:                             # nested child annotations (JSON/YAML native)
          - questionRef: "source-q1.1"
            answerType: "stringArray"
            value: ["Strain A", "Strain B"]
      - questionRef: "source-q2"
        answerType: "bool"
        value: true

The version field is the schema version of the import file format itself, independent of any QuestionSet version. answerType must match the target question's DataType — the backend rejects mismatches at validation time.

CSV (flat — subject to nesting constraints)

CSV is supported for projects where the question tree is flat, or where parent-child relationships are expressible via column ordering and the stage question tree (not via CSV nesting, which doesn't exist). One row per (study, annotator) combination. Column headers are the source question refs (mapped to target AnnotationQuestionV2.Id values in Step 2). Array answers use a per-column delimiter specified at mapping time (default ;).

For deeply nested question trees, JSON or YAML are the recommended formats — see brief.md Key Decision 4.


Validation Rules

Applied in Step 3 (async validation). Rules are classified by severity — errors block commit, warnings flag for admin attention.

Rule Severity
Target stage exists on project Error
Target StageQuestionSetVersion exists on stage and is published Error
All mapped questions exist in the referenced SQS version Error
Annotator is a member of the project Error
Study ID exists in the project Error
answerType in record matches the mapped question's DataType Error
No duplicate annotation for same (study, annotator, question, stage) within the import file Error
Question mapping covers all questionRef values appearing in the file Warning
Answer value is within the option set (for dropdown / checklist / radio questions) Warning
Parent question is answered when a child question is answered Warning (conditional warning)
Child question's parent exists in the import (not orphaned) Warning (orphan warning — resolvable)

Roadmap Context

Phase 6.5 sits between:

  • Phase 6: Question Management UI (AQ versioning UI, version badges, admin decision framework)
  • Phase 7: Release 1 Data Migration (published question backfill, SQS backfill, feature flags)

This placement means:

  • The QM v2 domain is available — imports map to AnnotationQuestionV2.Id values and pin versions via AnnotationQuestionVersionReference
  • All annotation UI is complete — imported annotations are immediately visible in the form
  • Import + migration (Phase 7) work well together — bring external data in, then run R1 migration to backfill versioning metadata for legacy stages

The feature is self-contained and does not block Phase 7.