Annotation Import Feature Design¶
Overview¶
Allow project administrators to import annotations from external tools into an existing SyRF project stage. The primary use case is migration from another systematic review tool (e.g., Rayyan, Covidence, custom spreadsheets) where researchers have already coded data and want to bring it into SyRF.
Roadmap placement: Phase 6.5 — inserted between Phase 6 (Question Management UI) and Phase 7 (Release 1 Data Migration). Targets the AnnotationQuestionV2 model from the QM v2 work landing on PRs #2572–#2575.
Implementation alignment: This spec predates the final QM v2 domain model. Where this document refers to "AQ", "AQVersion", "QSV", and their reference types, the authoritative current types are:
AnnotationQuestionV2(class) with its embeddedList<AQVersion>— eachAQVersionhas its ownGuid Id(the reference target) plus anint VersionNumberfor ordering.ProjectQuestionSet(PQS) — project-wide, ownsPQSVersionhistory.StageQuestionSet(SQS) — per-stage subset of a specificPQSVersion.- Cross-aggregate references use typed composite value-object records defined in
VersioningValueObjects.cs: AnnotationQuestionVersionReference(Guid QuestionId, Guid VersionId)StageQuestionSetVersionReference(Guid StageId, Guid VersionId)AnnotationVersionReference(Guid AnnotationId, Guid VersionId)AnnotationSessionVersionReference(Guid SessionId, Guid VersionId)
All imported annotations must be pinned to a specific AQVersion via AnnotationQuestionVersionReference, and the owning AnnotationSession pinned to a specific StageQuestionSetVersion via StageQuestionSetVersionReference, before they are written.
User Journey¶
Step 1 — Setup¶
Admin navigates to a stage and opens "Import Annotations." Uploads a CSV, JSON, or YAML file. The server detects the format from file extension and content sniffing, parses structure synchronously (headers only for CSV; top-level shape for JSON/YAML), and returns immediately with:
- Detected format
- Parsed structure (column headers + 3-row sample for CSV; detected field names + sample record for JSON/YAML)
- Stage question tree (questions as a hierarchy, for use in the mapping step)
Step 2 — Field Mapping¶
Admin maps three categories of fields:
- Study identifier — which column/field contains the SyRF study ID or custom ID, and which type it is (
SyRFStudyId | CustomId) - Annotator identifier — which column/field identifies the annotator, and whether it's an email or investigator ID
- Questions — maps source columns/fields to SyRF question IDs. The UI displays questions as a tree (not a flat list) to make parent-child relationships visible. Array-type questions are flagged with a semicolon-delimiter hint.
For JSON/YAML, the server attempts auto-mapping by matching JSON field names against question IDs (exact GUID match) and question text (case-insensitive). Matched questions are pre-filled; unmatched ones are presented for manual mapping or can be ignored.
Step 3 — Async Validation¶
On mapping submission the server enqueues a validation job (MassTransit consumer). Admin sees a "validating…" state. The job:
- Reads the file from GCS
- Parses all rows/records using the stored mapping
- Resolves study identifiers against project studies
- Resolves annotator identifiers against project members
- Reconstructs the annotation tree per (study, annotator) group — detecting structural issues
- Produces a
AnnotationValidationResult(metadata only — no parsed annotation objects stored)
Step 4 — Conflict Resolution¶
Admin reviews a structured report in four categories:
| Category | Description | Admin action |
|---|---|---|
| Annotation conflicts | Existing annotations for that annotator+study already exist in the target stage | Overwrite or Skip per group |
| Conditional warnings | Child question has a value but parent is blank or answered negatively | Acknowledge or exclude per row |
| Orphaned children | Child question answered, parent question absent entirely in the import data | Promote to root or Skip per item |
| Unmatched / unresolved | Study ID not found in project, or annotator not a project member | Auto-excluded, informational only |
Step 5 — Commit¶
Admin submits resolutions. The server:
- Re-reads the file from GCS using the stored mapping
- Applies resolutions (skip excluded groups/items)
- Resolves the target
StageQuestionSetVersion(the stage's current published SQS version) and captures itsStageQuestionSetVersionReference(StageId, VersionId) - For each question mapping, resolves the current published
AQVersion.Idon the mappedAnnotationQuestionV2and stores itsAnnotationQuestionVersionReference(QuestionId, VersionId) - Writes annotations depth-first — parents created first, IDs captured, children linked via
ParentId/Children— eachAnnotation/AnnotationVersionpair carrying theAnnotationQuestionVersionReferencecaptured above - For each (study, annotator) group, creates an
AnnotationSessionwith the SQS version reference, plus oneAnnotationSessionVersion(ASV) pinning each newAnnotationto its firstAnnotationVersion.Id - Marks job complete, deletes GCS file
Authorization: Project administrator only.
Data Model¶
New collection: pmAnnotationImportJob¶
A new AggregateRoot<Guid> entity in SyRF.ProjectManagement.Core.Model.AnnotationImportJobAggregate, following the existing collection pattern (same as DataExportJob).
Status lifecycle¶
Created → MappingPending → Validating → ValidationComplete → Committing → Completed
→ Failed
→ Cancelled
Entity fields¶
public class AnnotationImportJob : AggregateRoot<Guid>
{
public Guid ProjectId { get; set; }
public Guid StageId { get; set; }
public Guid CreatedByInvestigatorId { get; set; }
public DateTime CreatedAt { get; set; }
public DateTime? CompletedAt { get; set; }
public ImportFileInfo File { get; set; }
// { FileName, GcsObjectName, Format (CSV|JSON|YAML), SizeBytes }
public ParsedStructure? ParsedStructure { get; set; }
// CSV: { Headers: string[], SampleRows: string[][] }
// JSON/YAML: { DetectedFields: string[], SampleRecord: object }
public AnnotationImportMapping? Mapping { get; set; }
public AnnotationValidationResult? ValidationResult { get; set; }
public AnnotationImportResolutions? Resolutions { get; set; }
public ImportStats? Stats { get; set; }
// { AnnotationsImported, StudiesAffected, AnnotationsSkipped }
public AnnotationImportStatus Status { get; set; }
public string? ErrorMessage { get; set; }
}
Mapping (stored after Step 2)¶
public record AnnotationImportMapping(
string StudyIdentifierField,
StudyIdentifierType StudyIdentifierType, // SyRFStudyId | CustomId
string AnnotatorIdentifierField,
AnnotatorIdentifierType AnnotatorIdentifierType, // Email | InvestigatorId
IReadOnlyList<QuestionFieldMapping> QuestionMappings
);
public record QuestionFieldMapping(
string SourceField, // CSV column name or JSON key
Guid QuestionId, // SyRF question ID
bool IsArrayType, // Parse cell as semicolon-delimited list (CSV)
string ArrayDelimiter // Default ";"
);
Validation result (stored after async validation — metadata only)¶
The full parsed annotation data is not stored. The file remains in GCS and is re-read at commit time. This keeps the document size bounded for large imports.
public record AnnotationValidationResult(
IReadOnlyList<CleanGroup> CleanGroups,
// { StudyId, AnnotatorId, AnnotationCount }
IReadOnlyList<AnnotationConflict> Conflicts,
// { StudyId, AnnotatorId, ExistingCount, IncomingCount, Resolution: Pending }
IReadOnlyList<ConditionalWarning> ConditionalWarnings,
// { StudyId, AnnotatorId, ChildQuestionId, ParentQuestionId, ParentAnswer }
IReadOnlyList<OrphanedChildWarning> OrphanedChildren,
// { StudyId, AnnotatorId, OrphanedQuestionId, Resolution: Pending }
IReadOnlyList<string> UnmatchedStudyIds,
IReadOnlyList<string> UnresolvedAnnotators,
int TotalAnnotationsToImport,
int TotalStudiesAffected
);
Resolutions (stored after Step 4)¶
public record AnnotationImportResolutions(
IReadOnlyList<ConflictResolution> ConflictResolutions,
// { StudyId, AnnotatorId, Choice: Overwrite | Skip }
IReadOnlyList<ConditionalWarningResolution> ConditionalWarningResolutions,
// { StudyId, AnnotatorId, ChildQuestionId, Include: bool }
IReadOnlyList<OrphanedChildResolution> OrphanedChildResolutions
// { StudyId, AnnotatorId, OrphanedQuestionId, Choice: PromoteToRoot | Skip }
);
API¶
Five endpoints on the project-management service:
POST /api/projects/{projectId}/stages/{stageId}/annotation-imports
Multipart upload. Synchronously parses structure.
Returns: { jobId, format, parsedStructure, stageQuestions (as tree) }
POST /api/projects/{projectId}/annotation-imports/{jobId}/mapping
Submits field mapping. Triggers async validation.
Returns: 202 Accepted
GET /api/projects/{projectId}/annotation-imports/{jobId}
Polls for status + validation result.
Returns: { status, validationResult?, errorMessage? }
POST /api/projects/{projectId}/annotation-imports/{jobId}/confirm
Submits resolutions. Synchronously commits annotations.
Returns: { stats }
DELETE /api/projects/{projectId}/annotation-imports/{jobId}
Cancels job, deletes GCS file.
Backend Architecture¶
Format parsing¶
A IAnnotationFileParser interface with three implementations, all producing a unified internal representation:
IAnnotationFileParser
→ CsvAnnotationParser reads headers + column-mapped rows
→ JsonAnnotationParser reads fields + nested annotation records
→ YamlAnnotationParser converts YAML→JSON, delegates to JsonAnnotationParser
All parsers produce IEnumerable<ImportRecord>:
record ImportRecord(
string StudyIdentifierRaw,
string AnnotatorIdentifierRaw,
IReadOnlyList<ImportAnnotation> Annotations
);
record ImportAnnotation(
string QuestionKey, // Source field name (resolved to QuestionId via mapping)
string RawValue, // Raw string value from source
IReadOnlyList<ImportAnnotation> Children // Nested (JSON/YAML) or empty (CSV)
);
Async validation — MassTransit consumer¶
Following the same pattern as CsvDataExportWorkerConsumer:
POST /mapping
→ stores mapping (Status: Validating)
→ publishes ValidateAnnotationImport { JobId }
→ returns 202
AnnotationImportValidationConsumer
→ loads job
→ reads file from GCS
→ parses via IAnnotationFileParser
→ resolves study IDs and annotators
→ validates tree structure per group (orphan + conditional detection)
→ writes ValidationResult to job (Status: ValidationComplete)
Tree reconstruction — commit time¶
Shared for all formats. Runs depth-first:
AnnotationTreeBuilder.Build(importRecord, stageQuestionTree, mapping, resolutions)
→ for each root question in stage question tree that has a mapping:
create Annotation { Id = Guid.NewGuid(), Root = true, ParentId = null }
for each child question (depth-first):
if value present and not excluded by resolution:
create Annotation { Id = Guid.NewGuid(), Root = false, ParentId = parent.Id }
append Id to parent.Children
→ return flat List<Annotation> with tree links intact
→ pass to study.AddSessionData() (existing write path)
Tree validation — validation time¶
Detects issues using question IDs (not annotation IDs, which don't exist yet):
- Conditional warning: child question key present in import record AND parent question key is blank or parent answer is the "negative" value for that question type (e.g.,
falsefor boolean, empty for string) - Orphaned child: child question key present AND parent question key is entirely absent from the import record
File storage — GCS¶
- Service:
IGcsStorageServiceinSyRF.AppServices— wrapsGoogle.Cloud.Storage.V1 - Auth: Workload Identity (pod service account — no credentials to manage)
- Bucket:
syrf-annotation-imports(configurable viaGCS:AnnotationImportsBucket) - Lifecycle: Upload on POST → read on validation → read + delete on commit/cancel/failure
- Note: Establishes GCS as the standard for new features. Existing S3 usage (study import Lambda pipeline) is unchanged and migrated separately.
Tree Handling: Key Considerations¶
The annotation tree is the primary complexity of this feature. The Annotation model uses ParentId (pointing to the parent annotation ID, not question ID) and Children (list of child annotation IDs). These IDs don't exist until commit time, making tree reconstruction a two-pass problem.
CSV-specific tree challenges¶
- Flat representation: Wide format with one column per question. Parent-child relationships are inferred from the stage question tree + column mapping.
- Array-type questions:
BoolArrayAnnotation,StringArrayAnnotation, etc. are encoded as semicolon-separated values in a single cell (e.g.,"Option A;Option C"). The column mapping step flags these and specifies the delimiter. - Conditional logic: If a parent question is answered negatively, child columns should be empty. A non-empty child column when parent is negative triggers a conditional warning.
JSON/YAML-specific tree handling¶
- Nesting is native: The tree structure is explicit in the document. No inference needed.
- Auto-mapping: JSON field names are matched against question IDs (exact GUID) and question text (case-insensitive) to pre-fill mappings.
- YAML quirks: YAML type coercion (e.g.,
Yesparsed as boolean) must be handled in the parser before processing.
Depth¶
The algorithm handles arbitrary tree depth via recursive depth-first traversal. No hardcoded depth limit.
Import File Schema¶
JSON / YAML (canonical — supports full nesting)¶
# annotation-import.yaml
version: "1.0"
targetStageId: "<GUID>" # stage to import into
targetSqsVersionId: "<GUID>" # StageQuestionSetVersion this import is pinned to
annotatorIdField: "annotator" # source field carrying annotator identifier
annotatorIdType: "email" # email | investigatorId
studyIdField: "studyId" # source field carrying study identifier
studyIdType: "syrfStudyId" # syrfStudyId | customId
questionMapping: # map source question refs → target AnnotationQuestionV2 IDs
"source-q1": "<target-AnnotationQuestionV2-GUID>"
"source-q2": "<target-AnnotationQuestionV2-GUID>"
annotations:
- studyId: "<GUID-or-custom-id>"
annotator: "researcher@example.org"
answers:
- questionRef: "source-q1" # uses source ID, resolved via mapping
answerType: "string"
value: "Mouse"
children: # nested child annotations (JSON/YAML native)
- questionRef: "source-q1.1"
answerType: "stringArray"
value: ["Strain A", "Strain B"]
- questionRef: "source-q2"
answerType: "bool"
value: true
The version field is the schema version of the import file format itself, independent of any QuestionSet version. answerType must match the target question's DataType — the backend rejects mismatches at validation time.
CSV (flat — subject to nesting constraints)¶
CSV is supported for projects where the question tree is flat, or where parent-child relationships are expressible via column ordering and the stage question tree (not via CSV nesting, which doesn't exist). One row per (study, annotator) combination. Column headers are the source question refs (mapped to target AnnotationQuestionV2.Id values in Step 2). Array answers use a per-column delimiter specified at mapping time (default ;).
For deeply nested question trees, JSON or YAML are the recommended formats — see brief.md Key Decision 4.
Validation Rules¶
Applied in Step 3 (async validation). Rules are classified by severity — errors block commit, warnings flag for admin attention.
| Rule | Severity |
|---|---|
| Target stage exists on project | Error |
Target StageQuestionSetVersion exists on stage and is published |
Error |
| All mapped questions exist in the referenced SQS version | Error |
| Annotator is a member of the project | Error |
| Study ID exists in the project | Error |
answerType in record matches the mapped question's DataType |
Error |
| No duplicate annotation for same (study, annotator, question, stage) within the import file | Error |
Question mapping covers all questionRef values appearing in the file |
Warning |
| Answer value is within the option set (for dropdown / checklist / radio questions) | Warning |
| Parent question is answered when a child question is answered | Warning (conditional warning) |
| Child question's parent exists in the import (not orphaned) | Warning (orphan warning — resolvable) |
Roadmap Context¶
Phase 6.5 sits between:
- Phase 6: Question Management UI (AQ versioning UI, version badges, admin decision framework)
- Phase 7: Release 1 Data Migration (published question backfill, SQS backfill, feature flags)
This placement means:
- The QM v2 domain is available — imports map to
AnnotationQuestionV2.Idvalues and pin versions viaAnnotationQuestionVersionReference - All annotation UI is complete — imported annotations are immediately visible in the form
- Import + migration (Phase 7) work well together — bring external data in, then run R1 migration to backfill versioning metadata for legacy stages
The feature is self-contained and does not block Phase 7.