Stage Filtering¶

Overview¶

Feature Name: Stage Filtering Target Users: Project Administrators, Reviewers Business Value: Configurable, stage-scoped study pools that enable multi-stage screening pipelines within a single project Phase: Phase 1 (with Screening Profiles)

Problem & Solution¶

SyRF currently applies a single, global screening filter: when a study is excluded, it disappears from the entire project. This works for single-stage reviews but breaks down for multi-stage pipelines. There is no way to configure what a specific stage operates on — every stage sees the same globally-filtered set of studies.

With Screening Profiles introducing multiple sets of criteria per project, a new question arises: how does a stage know which studies are in its scope? If a project has both "TA Criteria" and "FT Criteria", the Full-Text stage should only see studies that were included under TA — not the full project corpus, and not studies excluded under FT by other reviewers.

Stage Filtering answers this with two concepts:

Stage Study Pool — The reviewer-agnostic set of studies eligible for work in a given stage, defined by the stage's Filter Set: a collection of configurable rules referencing Screening Profile outcomes. This replaces the implicit global filter with explicit, per-stage configuration.
Selection Subset — The reviewer-aware subset derived at selection time by applying stage policies (HideExcludedStudiesFromReviewers, MaxInProgress, per-reviewer suppression). These are personalised visibility rules layered on top of the pool — they don't change pool membership.

Filter Set Model¶

A Filter Set is a JSON structure stored on each stage that defines its Study Pool. The schema supports nested groups and multiple conditions from day one, even though the MVP UI will only expose simple cases.

Schema (v2)¶

{
  "version": 2,
  "logic": "AND",
  "rules": [
    {
      "type": "profileOutcome",
      "profileId": "<guid>",
      "op": "in",
      "values": ["Included", "Conflict"]
    }
  ]
}

Rule Types¶

Type	MVP	Description
`profileOutcome`	Yes	Filter by Screening Outcome for a specific profile. Ops: `in`, `notIn`
`annotation`	Phase 3	Filter by reconciled annotation values. Ops: `in`, `notIn`, `any`, `all`

Groups combine rules with AND/OR logic and can nest to arbitrary depth. The backend validates and stores the full schema; the MVP UI exposes a single pass-forward rule: "Include studies where Outcome(Profile X) is in {Included, Conflict}".

Compilation to MongoDB¶

The FilterCompiler translates Filter Set JSON into efficient MongoDB queries. Profile outcome rules compile to $elemMatch on the screeningOutcomes array, ensuring conditions match on the same array element (same profile):

// profileOutcome rule → $elemMatch
f.ElemMatch("screeningOutcomes", f.And(
    f.Eq("profileId", rule.ProfileId),
    f.In("status", rule.Values)
))

A simplifier runs before compilation and is a correctness requirement, not just an optimization. When two profileOutcome rules target the same profileId but are compiled as separate $elemMatch blocks, MongoDB can match them against different array elements — returning studies that satisfy rule A on one outcome entry and rule B on another, rather than requiring both conditions on the same entry. The simplifier prevents this by merging rules per profileId before compilation:

AND(in(A), in(B)) for same profileId → in(A ∩ B) (single $elemMatch)
AND(in(A), notIn(B)) for same profileId → in(A - B) (single $elemMatch)
Flatten nested groups with the same logic operator
Remove tautologies and contradictions

Validation¶

The backend returns 422 for invalid profile IDs, empty rule sets, or unsupported operators. Circular references (stage A filters on stage B's profile, which filters on stage A's profile) are detected at save time.

Stage Settings & Selection¶

Stage Settings control the reviewer experience within a stage. They are configured per-stage by a Project Admin.

Required Settings¶

StudySelectionMode — Required at stage creation (no global default). Values: screening | annotation | screeningAndAnnotation | reconciliation
Screening Profile — Which profile governs screening decisions in this stage

Policy Settings¶

All selection-time rules — never part of the Stage Study Pool:

HideExcludedStudiesFromReviewers (default: ON) — Suppress already-excluded studies from screening selection. Does NOT apply when StudySelectionMode = reconciliation, because screening annotation reconcilers need to see excluded studies to reconcile their annotations.
MaxInProgress — Cap on concurrent in-progress studies per reviewer
SessionCountTarget — How many candidate annotation sessions are needed before a study is considered fully reviewed for data extraction. This applies to annotation and reconciliation modes only — screening decision counts are governed by the Screening Profile's agreement mode, not by this setting
SelfReconciliation (default: OFF) — Whether a reviewer who screened the study can also be its screening annotation reconciler

Selection Behaviour by Mode¶

Mode	"Get next" returns	Pool basis
`screening`	Next unscreened study from Stage Study Pool, filtered by policies	Stage Study Pool minus per-reviewer suppression
`annotation`	Next study needing data extraction annotations	Included studies (Final Screening Outcome = Included)
`screeningAndAnnotation`	Next study for combined screening + annotation	Stage Study Pool, reviewer screens then annotates in one session
`reconciliation`	Next study eligible for screening annotation reconciliation	Screening Annotation Reconciliation Pool, respecting SelfReconciliation policy

Reconciliation: Acts, Workflow, and Stage Settings¶

The screening pipeline involves two distinct reconciliation acts that produce different entities. Only one is implemented at MVP; both are described here for completeness.

Reconciliation Act	Entity Created	Selection Mode	Implemented
Screening annotation reconciliation	Reconciled Screening (decision + screening annotation)	`reconciliation`	Phase 2 (with Screening Annotations)
Data extraction reconciliation	Reconciled Annotation Session (authoritative data extraction answers)	Future — own mode or integration with above	Future

How each relates to stage settings:

Screening decision reconciliation is automatic (agreement mode) — it doesn't appear as a selection mode because it happens implicitly as screening decisions accumulate. The number of decisions required is governed by the Screening Profile's agreement mode, not by SessionCountTarget.
Screening annotation reconciliation is the reconciliation selection mode above — a manual process where a reconciler reviews candidate screenings and creates a Reconciled Screening. SelfReconciliation controls whether a reviewer who screened the study can also be its reconciler. HideExcludedStudiesFromReviewers does NOT apply in this mode, because reconcilers need to see excluded studies.
Data extraction reconciliation is a future feature that will need its own selection mode or integration with the reconciliation mode. SessionCountTarget applies to data extraction — it controls how many candidate annotation sessions are needed before a study is eligible for this reconciliation.

Integrated workflow (future): The two reconciliation acts are always conceptually distinct (different entities, different purposes). However, the user-facing workflow could be integrated into a unified reconciler workbench — shared assignment queues, a single study view covering both screening annotations and data extraction — without merging the underlying data models. This is a UX concern to be addressed when data extraction reconciliation is designed.

API & Integration¶

Filter Set Management (Admin)¶

Filter Sets are managed as part of stage configuration. No separate endpoint — they're persisted as a property of the stage document.

PUT /api/projects/{projectId}/stages/{stageId} — Update stage settings including Filter Set
Validation returns 422 if Filter Set references nonexistent profile IDs, invalid operators, or malformed group structure
Filter Set changes take effect immediately for subsequent select_next calls (no cache)

Stage Study Pool via Studies Endpoint¶

GET /api/projects/{projectId}/studies?stageId={stageId} — Returns the Stage Study Pool (reviewer-agnostic)
When stageId is provided, the stage's Filter Set is compiled to a MongoDB query and applied
When stageId is omitted, returns all project studies (existing behaviour preserved)
No per-reviewer suppression applied — this is the raw pool for admin review and stats

Selection (Reviewer-Facing)¶

POST /api/projects/{projectId}/stages/{stageId}/studies/{studyId}/review — Submit screening decision (existing pattern, unchanged)
Selection logic (get-next) applies the Stage Study Pool first, then layers policies (HideExcluded, MaxInProgress, per-reviewer suppression) at selection time
When no eligible study remains, the endpoint returns 204 No Content — the UI should handle this as "no more studies available" rather than an error

Stats¶

GET /api/projects/{projectId}/stages/{stageId}/stats — Returns counts scoped to the Stage Study Pool:
AvailableForScreening, AvailableForAnnotation, ReconciliationEligible
InProgress(caller), Completed(caller), ReconciliationInProgress(caller)

Important scoping distinction: Screening stats (screened count, pending, conflicts) are inherently profile-scoped — a study's screening outcome belongs to its Screening Profile regardless of which stage uses it. The stage stats endpoint resolves the stage's assigned profile and reports screening stats for that profile, but the underlying data is not stage-owned. Annotation and data extraction stats are genuinely stage-scoped. See Screening Profiles — Stats Are Profile-Scoped.

Legacy Compatibility¶

Stages without a Filter Set have an implicit empty pool = all project studies (current behaviour)
No migration needed — existing stages continue working as-is
Filter Sets are purely additive; admins opt in when they configure multi-stage pipelines

Performance & Indexing¶

Required Indexes¶

The following compound indexes support the Filter Set compilation and selection queries:

Index	Purpose
`screeningOutcomes.profileId` + `screeningOutcomes.status`	`$elemMatch` for `profileOutcome` rules
`screeningOutcomes.profileId` + `screeningOutcomes.decisions.reviewerId`	Per-reviewer suppression in selection
Study-level `rand` field (ascending)	Efficient random selection (see below)

Phase 3 (annotation-based filtering) will require additional indexes on reconciled annotation values — likely flattened fields (reconciledAnnotations.<questionId>) rather than wildcard indexes, to maintain selectivity.

Random Selection Strategy¶

For "get next" to select a random study from the filtered pool efficiently, avoid $sample after a large $match pipeline (CPU-heavy scan). Instead, store a precomputed rand field (double in [0, 1)) on each Study document:

r = random()
q1: match(pool & rand >= r) sort(rand ASC) limit(1)
if none: q2: match(pool & rand < r) sort(rand ASC) limit(1)

This uses the rand index for an efficient range scan. The wrap-around query (q2) handles the case where r is near 1.0.

Known Performance Risks¶

Risk	Trigger	Mitigation
Multikey fan-out	Many profiles across large corpus → large index entries per study	Pre-filter by `profileId` in the `$elemMatch`; keep profile count per project reasonable
Deep OR trees	Complex Filter Sets with deeply nested OR groups	Simplifier flattens redundant nesting; push most selective branches first in compiled pipeline

Observability¶

Key metrics to emit for selection and filtering:

select_next.duration_ms — selection latency (basis for p95 target)
select_next.mode — which selection mode was used
select_next.filtered_out_reason — why a study was skipped (suppression / hiddenExcluded / maxInProgress)
filter_compile.duration_ms — Filter Set compilation time
stage_study_pool.count — pool size per stage (for dashboard monitoring)

Collect a 2-week baseline post-launch before setting SLO alert thresholds.

Success Criteria¶

select_next p95 < 400 ms with Filter Set compilation
Stage Study Pool counts are consistent with Filter Set rules (verified by automated tests)
Legacy stages with no Filter Set behave identically to current behaviour
Admin can configure a two-stage pipeline (TA → FT) in ≤ 5 minutes

Screening Profiles — Named screening criteria configurations
Screening Annotations — Structured exclusion reasons and reconciliation
Stage Settings — Per-stage configuration policies layered on top of the study pool
Annotation Versioning — Foundational versioning pattern for all entities
Question Management — Question lifecycle from draft through versioning
Reconciliation — Annotation reconciliation workflow and authority determination
Annotation Form V2 — Rebuilt annotation form with per-question auto-save