Annotation Import¶

Summary¶

Phase 06.1 allows project administrators to import annotations from external systematic review tools — such as Rayyan, Covidence, or spreadsheet exports — into an existing SyRF project stage. A guided 5-step wizard walks the administrator through the process: upload a file, map the columns to SyRF questions, wait for automated validation, resolve any conflicts, and commit the import. This is an urgent addition to Release 1 that addresses a common request from teams migrating to SyRF mid-review.

The Problem¶

Research teams often start a systematic review using one tool and later switch to SyRF for its advanced annotation and reconciliation capabilities. When they switch, they have potentially thousands of annotations sitting in their previous tool that they need to bring into SyRF. Currently, there is no way to do this — teams must either re-do the annotation work in SyRF (wasting months of effort) or maintain two tools in parallel (error-prone and confusing).

Additionally, projects migrating between SyRF instances or merging data from multiple projects need a way to transfer annotations.

This is a significant barrier to SyRF adoption. Teams that have invested heavily in annotation using another tool are reluctant to switch because it means losing or duplicating that work.

User Stories¶

As a project admin, I want to import annotations from an external file so that my team doesn't have to re-annotate studies that were already reviewed.
As a project admin, I want to map imported annotation fields to my project's questions so that the data aligns correctly.
As a project admin, I want to see a validation report before committing an import so that I can catch errors.
As a project admin, I want to import annotations for specific stages so that I can target the correct review phase.
As a researcher, I want imported annotations to appear alongside manually-created annotations so that data export is unified.

What We're Building¶

For Project Administrators¶

A 5-step import wizard accessible from the stage administration area:

flowchart LR
    A["1. Upload<br/>File"] --> B["2. Map<br/>Fields"]
    B --> C["3. Validate<br/>(automatic)"]
    C --> D["4. Resolve<br/>Conflicts"]
    D --> E["5. Commit<br/>Import"]

    style A fill:#e3f2fd
    style B fill:#e3f2fd
    style C fill:#fff3e0
    style D fill:#fff3e0
    style E fill:#e8f5e9

Step 1 — Upload: The administrator uploads a file (JSON, YAML, or CSV format) containing annotation data from the external tool. The system immediately parses the file structure and shows the detected column names and sample data.

Step 2 — Map fields: The administrator maps columns from the uploaded file to SyRF concepts: - Which column identifies the study (by SyRF study ID or a custom identifier) - Which column identifies the annotator (by email or SyRF user ID) - Which columns map to which annotation questions in the target stage

Step 3 — Validate: The system automatically validates the entire file in the background: - Resolves study identifiers to actual studies in the project - Resolves annotator identifiers to actual SyRF users - Checks for conflicts with existing annotations (e.g., the same annotator already has answers for a study) - Detects orphaned child annotations (answers to conditional questions where the parent question has no answer) - Reports unmatched studies and unresolved annotators

Step 4 — Resolve conflicts: The administrator reviews the validation results and makes decisions: - For studies where imported annotations conflict with existing ones: overwrite the existing data, or skip the import for that study? - For orphaned child annotations: promote them to top-level answers, or skip them? - The administrator sees clear counts: X studies clean, Y studies with conflicts, Z unmatched

Step 5 — Commit: The system imports the annotations, creating proper Annotation and AnnotationSession records linked to the correct studies, annotators, and questions. Each imported answer is pinned to a specific AQVersion via the AnnotationQuestionVersionReference(QuestionId, VersionId) composite record, giving imported annotations the same audit trail as manually-entered ones. A summary shows the final statistics: how many annotations were imported, how many studies were affected, and how many were skipped.

For Reviewers/Annotators¶

After the import completes, annotators will see the imported answers appear in their annotation forms as if they had entered them directly. The import is transparent to the annotation workflow.

Scope¶

In Scope¶

Import annotations from structured files in JSON, YAML, or CSV formats
Question ID mapping: map source question identifiers to target AnnotationQuestionV2 IDs
Version awareness: each imported annotation references the current published AQVersion via an AnnotationQuestionVersionReference composite value object
Stage targeting: import into a specific stage's currently published Stage Question Set version
Annotator attribution: assign imported annotations to a specific investigator
Validation: async server-side dry-run producing errors, warnings, and a summary
Parent-child annotation relationships: preserve hierarchical annotation structure (subject to format capability — see Key Decisions)
Support for all annotation answer types: Bool, Decimal, String, Int, IntArray, BoolArray, DecimalArray, StringArray
Conflict resolution UI: administrator chooses per-study whether to overwrite, skip, or merge when imported annotations clash with existing ones

Out of Scope for Initial Release¶

Real-time streaming import (batch only)
Direct connectors to specific third-party platforms (Covidence, Rayyan, etc.) — users export from those tools and import the portable file here
Automatic question matching by text similarity (manual mapping only)
Import of OutcomeData (quantitative data extraction) — annotations only
Idempotent re-import of the same file (idempotency behaviour is an open question — see below)

Dependencies¶

Question Management v2 (PRs #2572–#2575): the AnnotationQuestionV2 entity, AQVersion, and ProjectQuestionSet/StageQuestionSet model must be landed. Imports create versioned annotations from day one and need the typed VersionReference value objects (AnnotationQuestionVersionReference, StageQuestionSetVersionReference) to establish audit trails.
Published question set: import targets a specific published StageQuestionSetVersion. Questions must be published (draft state is invalid for import targets) before annotations can be imported against them.
Annotation session model: the import must create AnnotationSession entities with StageQuestionSetVersionReference pinned at import time. Sessions may be one-per-study or one-per-batch (see Open Questions).

Key Decisions¶

Server-side validation — The entire file is validated in a background process (MassTransit consumer) rather than in the browser. This allows large files (thousands of rows) to be processed without the browser becoming unresponsive. The pattern follows the existing BulkStudyUpdateJob / StudyUpdateRecordProcessor architecture.
Conflict resolution is explicit — When imported data conflicts with existing annotations, the system does not silently overwrite. The administrator must explicitly choose what to do for each conflict group. This prevents accidental data loss.
Google Cloud Storage for temporary files — Uploaded files are stored temporarily in Google Cloud Storage (GCS) rather than the browser. This allows the validation and commit steps to run independently of the browser session, and establishes GCS as the standard for new file-handling features going forward (a deliberate move away from S3 as the platform consolidates on GCP).
Three file formats supported, with a note about nesting — CSV covers the most common export format from other tools. JSON and YAML support structured hierarchical data (parent-child question relationships) that flat CSV cannot represent cleanly. CSV imports are supported for projects where the question tree is flat or where parent-child relationships are expressed through explicit ID columns rather than nesting; for deeply nested question trees, JSON or YAML are the recommended formats. See design.md for the schema specification and format-specific parsers.
Versioned from day one — Every imported annotation is pinned to the AQVersion that was current when the import committed, using the same typed AnnotationQuestionVersionReference records that manually-entered annotations use. Imported annotations participate in the same reconciliation and versioning workflows as any other annotation.

Open Questions¶

Session creation strategy: Should import create one AnnotationSession per study, or one session for the entire batch? The current per-reviewer model is one session per study per stage per annotator.
Reconciliation status: Should imported annotations be marked as reconciled (gold-standard) or candidate answers? This affects whether they appear in the reconciliation queue for other reviewers.
Idempotency: Should re-importing the same file be safe (upsert by content hash) or rejected (duplicate check)?
Progress tracking: Synchronous API call for small imports vs. async job (like BulkStudyUpdateJob) for all imports?
Maximum batch size: What's a reasonable limit per import file? Thousands of annotations? Tens of thousands? Dictates whether validation/commit streams or buffers the file.

Reference: Existing Import Patterns in SyRF¶

The screening import (BulkStudyUpdateJob / StudyUpdateRecordProcessor) provides a direct reference pattern:

Job-based async processing via MassTransit
Per-record validation and error collection
Progress tracking on the job entity
Dry-run mode for pre-import validation
File uploaded to cloud storage, processed from there

Annotation import follows the same architectural pattern but with annotation-specific validation and entity creation (and GCS instead of S3 for the file upload — see Key Decisions).

How It Connects¶

Phase 06.1 depends on Phase 6 (Question Management UI) because the import targets a stage with a configured, published question set. It is an urgent insertion into the Release 1 timeline because it addresses a key adoption barrier. The import feature uses the same annotation data model that Phases 4–5 establish (Annotation with embedded AnnotationVersion list, AnnotationSession with per-session AnnotationVersionReference map).