Skip to content

Stage Filtering

Overview

Feature Name: Stage Filtering Target Users: Project Administrators, Reviewers Business Value: Configurable, stage-scoped study pools that enable multi-stage screening pipelines within a single project Phase: Phase 1 (with Screening Profiles)


Problem & Solution

SyRF currently applies a single, global screening filter: when a study is excluded, it disappears from the entire project. This works for single-stage reviews but breaks down for multi-stage pipelines. There is no way to configure what a specific stage operates on — every stage sees the same globally-filtered set of studies.

With Screening Profiles introducing multiple sets of criteria per project, a new question arises: how does a stage know which studies are in its scope? If a project has both "TA Criteria" and "FT Criteria", the Full-Text stage should only see studies that were included under TA — not the full project corpus, and not studies excluded under FT by other reviewers.

Stage Filtering answers this with two concepts:

  • Stage Study Pool — The reviewer-agnostic set of studies eligible for work in a given stage, defined by the stage's Filter Set: a collection of configurable rules referencing Screening Profile outcomes. This replaces the implicit global filter with explicit, per-stage configuration.

  • Selection Subset — The reviewer-aware subset derived at selection time by applying stage policies (HideExcludedStudiesFromReviewers, MaxInProgress, per-reviewer suppression). These are personalised visibility rules layered on top of the pool — they don't change pool membership.


Filter Set Model

A Filter Set is a JSON structure stored on each stage that defines its Study Pool. The schema supports nested groups and multiple conditions from day one, even though the MVP UI will only expose simple cases.

Schema (v2)

{
  "version": 2,
  "logic": "AND",
  "rules": [
    {
      "type": "profileOutcome",
      "profileId": "<guid>",
      "op": "in",
      "values": ["Included", "Conflict"]
    }
  ]
}

Rule Types

Type MVP Description
profileOutcome Yes Filter by Screening Outcome for a specific profile. Ops: in, notIn
annotation Phase 3 Filter by reconciled annotation values. Ops: in, notIn, any, all

Groups combine rules with AND/OR logic and can nest to arbitrary depth. The backend validates and stores the full schema; the MVP UI exposes a single pass-forward rule: "Include studies where Outcome(Profile X) is in {Included, Conflict}".

Compilation to MongoDB

The FilterCompiler translates Filter Set JSON into efficient MongoDB queries. Profile outcome rules compile to $elemMatch on the screeningOutcomes array, ensuring conditions match on the same array element (same profile):

// profileOutcome rule → $elemMatch
f.ElemMatch("screeningOutcomes", f.And(
    f.Eq("profileId", rule.ProfileId),
    f.In("status", rule.Values)
))

A simplifier runs before compilation and is a correctness requirement, not just an optimization. When two profileOutcome rules target the same profileId but are compiled as separate $elemMatch blocks, MongoDB can match them against different array elements — returning studies that satisfy rule A on one outcome entry and rule B on another, rather than requiring both conditions on the same entry. The simplifier prevents this by merging rules per profileId before compilation:

  • AND(in(A), in(B)) for same profileId → in(A ∩ B) (single $elemMatch)
  • AND(in(A), notIn(B)) for same profileId → in(A - B) (single $elemMatch)
  • Flatten nested groups with the same logic operator
  • Remove tautologies and contradictions

Validation

The backend returns 422 for invalid profile IDs, empty rule sets, or unsupported operators. Circular references (stage A filters on stage B's profile, which filters on stage A's profile) are detected at save time.


Stage Settings & Selection

Stage Settings control the reviewer experience within a stage. They are configured per-stage by a Project Admin.

Required Settings

  • StudySelectionMode — Required at stage creation (no global default). Values: screening | annotation | screeningAndAnnotation | reconciliation
  • Screening Profile — Which profile governs screening decisions in this stage

Policy Settings

All selection-time rules — never part of the Stage Study Pool:

  • HideExcludedStudiesFromReviewers (default: ON) — Suppress already-excluded studies from screening selection. Does NOT apply when StudySelectionMode = reconciliation, because screening annotation reconcilers need to see excluded studies to reconcile their annotations.
  • MaxInProgress — Cap on concurrent in-progress studies per reviewer
  • SessionCountTarget — How many candidate annotation sessions are needed before a study is considered fully reviewed for data extraction. This applies to annotation and reconciliation modes only — screening decision counts are governed by the Screening Profile's agreement mode, not by this setting
  • SelfReconciliation (default: OFF) — Whether a reviewer who screened the study can also be its screening annotation reconciler

Selection Behaviour by Mode

Mode "Get next" returns Pool basis
screening Next unscreened study from Stage Study Pool, filtered by policies Stage Study Pool minus per-reviewer suppression
annotation Next study needing data extraction annotations Included studies (Final Screening Outcome = Included)
screeningAndAnnotation Next study for combined screening + annotation Stage Study Pool, reviewer screens then annotates in one session
reconciliation Next study eligible for screening annotation reconciliation Screening Annotation Reconciliation Pool, respecting SelfReconciliation policy

Reconciliation: Acts, Workflow, and Stage Settings

The screening pipeline involves two distinct reconciliation acts that produce different entities. Only one is implemented at MVP; both are described here for completeness.

Reconciliation Act Entity Created Selection Mode Implemented
Screening annotation reconciliation Reconciled Screening (decision + screening annotation) reconciliation Phase 2 (with Screening Annotations)
Data extraction reconciliation Reconciled Annotation Session (authoritative data extraction answers) Future — own mode or integration with above Future

How each relates to stage settings:

  • Screening decision reconciliation is automatic (agreement mode) — it doesn't appear as a selection mode because it happens implicitly as screening decisions accumulate. The number of decisions required is governed by the Screening Profile's agreement mode, not by SessionCountTarget.
  • Screening annotation reconciliation is the reconciliation selection mode above — a manual process where a reconciler reviews candidate screenings and creates a Reconciled Screening. SelfReconciliation controls whether a reviewer who screened the study can also be its reconciler. HideExcludedStudiesFromReviewers does NOT apply in this mode, because reconcilers need to see excluded studies.
  • Data extraction reconciliation is a future feature that will need its own selection mode or integration with the reconciliation mode. SessionCountTarget applies to data extraction — it controls how many candidate annotation sessions are needed before a study is eligible for this reconciliation.

Integrated workflow (future): The two reconciliation acts are always conceptually distinct (different entities, different purposes). However, the user-facing workflow could be integrated into a unified reconciler workbench — shared assignment queues, a single study view covering both screening annotations and data extraction — without merging the underlying data models. This is a UX concern to be addressed when data extraction reconciliation is designed.


API & Integration

Filter Set Management (Admin)

Filter Sets are managed as part of stage configuration. No separate endpoint — they're persisted as a property of the stage document.

  • PUT /api/projects/{projectId}/stages/{stageId} — Update stage settings including Filter Set
  • Validation returns 422 if Filter Set references nonexistent profile IDs, invalid operators, or malformed group structure
  • Filter Set changes take effect immediately for subsequent select_next calls (no cache)

Stage Study Pool via Studies Endpoint

  • GET /api/projects/{projectId}/studies?stageId={stageId} — Returns the Stage Study Pool (reviewer-agnostic)
  • When stageId is provided, the stage's Filter Set is compiled to a MongoDB query and applied
  • When stageId is omitted, returns all project studies (existing behaviour preserved)
  • No per-reviewer suppression applied — this is the raw pool for admin review and stats

Selection (Reviewer-Facing)

  • POST /api/projects/{projectId}/stages/{stageId}/studies/{studyId}/review — Submit screening decision (existing pattern, unchanged)
  • Selection logic (get-next) applies the Stage Study Pool first, then layers policies (HideExcluded, MaxInProgress, per-reviewer suppression) at selection time
  • When no eligible study remains, the endpoint returns 204 No Content — the UI should handle this as "no more studies available" rather than an error

Stats

  • GET /api/projects/{projectId}/stages/{stageId}/stats — Returns counts scoped to the Stage Study Pool:
  • AvailableForScreening, AvailableForAnnotation, ReconciliationEligible
  • InProgress(caller), Completed(caller), ReconciliationInProgress(caller)

Important scoping distinction: Screening stats (screened count, pending, conflicts) are inherently profile-scoped — a study's screening outcome belongs to its Screening Profile regardless of which stage uses it. The stage stats endpoint resolves the stage's assigned profile and reports screening stats for that profile, but the underlying data is not stage-owned. Annotation and data extraction stats are genuinely stage-scoped. See Screening Profiles — Stats Are Profile-Scoped.

Legacy Compatibility

  • Stages without a Filter Set have an implicit empty pool = all project studies (current behaviour)
  • No migration needed — existing stages continue working as-is
  • Filter Sets are purely additive; admins opt in when they configure multi-stage pipelines

Performance & Indexing

Required Indexes

The following compound indexes support the Filter Set compilation and selection queries:

Index Purpose
screeningOutcomes.profileId + screeningOutcomes.status $elemMatch for profileOutcome rules
screeningOutcomes.profileId + screeningOutcomes.decisions.reviewerId Per-reviewer suppression in selection
Study-level rand field (ascending) Efficient random selection (see below)

Phase 3 (annotation-based filtering) will require additional indexes on reconciled annotation values — likely flattened fields (reconciledAnnotations.<questionId>) rather than wildcard indexes, to maintain selectivity.

Random Selection Strategy

For "get next" to select a random study from the filtered pool efficiently, avoid $sample after a large $match pipeline (CPU-heavy scan). Instead, store a precomputed rand field (double in [0, 1)) on each Study document:

r = random()
q1: match(pool & rand >= r) sort(rand ASC) limit(1)
if none: q2: match(pool & rand < r) sort(rand ASC) limit(1)

This uses the rand index for an efficient range scan. The wrap-around query (q2) handles the case where r is near 1.0.

Known Performance Risks

Risk Trigger Mitigation
Multikey fan-out Many profiles across large corpus → large index entries per study Pre-filter by profileId in the $elemMatch; keep profile count per project reasonable
Deep OR trees Complex Filter Sets with deeply nested OR groups Simplifier flattens redundant nesting; push most selective branches first in compiled pipeline

Observability

Key metrics to emit for selection and filtering:

  • select_next.duration_ms — selection latency (basis for p95 target)
  • select_next.mode — which selection mode was used
  • select_next.filtered_out_reason — why a study was skipped (suppression / hiddenExcluded / maxInProgress)
  • filter_compile.duration_ms — Filter Set compilation time
  • stage_study_pool.count — pool size per stage (for dashboard monitoring)

Collect a 2-week baseline post-launch before setting SLO alert thresholds.


Success Criteria

  • select_next p95 < 400 ms with Filter Set compilation
  • Stage Study Pool counts are consistent with Filter Set rules (verified by automated tests)
  • Legacy stages with no Filter Set behave identically to current behaviour
  • Admin can configure a two-stage pipeline (TA → FT) in ≤ 5 minutes