Stage Filtering¶
Overview¶
Feature Name: Stage Filtering Target Users: Project Administrators, Reviewers Business Value: Configurable, stage-scoped study pools that enable multi-stage screening pipelines within a single project Phase: Phase 1 (with Screening Profiles)
Problem & Solution¶
SyRF currently applies a single, global screening filter: when a study is excluded, it disappears from the entire project. This works for single-stage reviews but breaks down for multi-stage pipelines. There is no way to configure what a specific stage operates on — every stage sees the same globally-filtered set of studies.
With Screening Profiles introducing multiple sets of criteria per project, a new question arises: how does a stage know which studies are in its scope? If a project has both "TA Criteria" and "FT Criteria", the Full-Text stage should only see studies that were included under TA — not the full project corpus, and not studies excluded under FT by other reviewers.
Stage Filtering answers this with two concepts:
-
Stage Study Pool — The reviewer-agnostic set of studies eligible for work in a given stage, defined by the stage's Filter Set: a collection of configurable rules referencing Screening Profile outcomes. This replaces the implicit global filter with explicit, per-stage configuration.
-
Selection Subset — The reviewer-aware subset derived at selection time by applying stage policies (
HideExcludedStudiesFromReviewers,MaxInProgress, per-reviewer suppression). These are personalised visibility rules layered on top of the pool — they don't change pool membership.
Filter Set Model¶
A Filter Set is a JSON structure stored on each stage that defines its Study Pool. The schema supports nested groups and multiple conditions from day one, even though the MVP UI will only expose simple cases.
Schema (v2)¶
{
"version": 2,
"logic": "AND",
"rules": [
{
"type": "profileOutcome",
"profileId": "<guid>",
"op": "in",
"values": ["Included", "Conflict"]
}
]
}
Rule Types¶
| Type | MVP | Description |
|---|---|---|
profileOutcome |
Yes | Filter by Screening Outcome for a specific profile. Ops: in, notIn |
annotation |
Phase 3 | Filter by reconciled annotation values. Ops: in, notIn, any, all |
Groups combine rules with AND/OR logic and can nest to arbitrary depth. The backend validates and stores the full schema; the MVP UI exposes a single pass-forward rule: "Include studies where Outcome(Profile X) is in {Included, Conflict}".
Compilation to MongoDB¶
The FilterCompiler translates Filter Set JSON into efficient MongoDB queries. Profile outcome rules compile to $elemMatch on the screeningOutcomes array, ensuring conditions match on the same array element (same profile):
// profileOutcome rule → $elemMatch
f.ElemMatch("screeningOutcomes", f.And(
f.Eq("profileId", rule.ProfileId),
f.In("status", rule.Values)
))
A simplifier runs before compilation and is a correctness requirement, not just an optimization. When two profileOutcome rules target the same profileId but are compiled as separate $elemMatch blocks, MongoDB can match them against different array elements — returning studies that satisfy rule A on one outcome entry and rule B on another, rather than requiring both conditions on the same entry. The simplifier prevents this by merging rules per profileId before compilation:
AND(in(A), in(B))for same profileId →in(A ∩ B)(single$elemMatch)AND(in(A), notIn(B))for same profileId →in(A - B)(single$elemMatch)- Flatten nested groups with the same logic operator
- Remove tautologies and contradictions
Validation¶
The backend returns 422 for invalid profile IDs, empty rule sets, or unsupported operators. Circular references (stage A filters on stage B's profile, which filters on stage A's profile) are detected at save time.
Stage Settings & Selection¶
Stage Settings control the reviewer experience within a stage. They are configured per-stage by a Project Admin.
Required Settings¶
- StudySelectionMode — Required at stage creation (no global default). Values:
screening | annotation | screeningAndAnnotation | reconciliation - Screening Profile — Which profile governs screening decisions in this stage
Policy Settings¶
All selection-time rules — never part of the Stage Study Pool:
- HideExcludedStudiesFromReviewers (default: ON) — Suppress already-excluded studies from screening selection. Does NOT apply when
StudySelectionMode = reconciliation, because screening annotation reconcilers need to see excluded studies to reconcile their annotations. - MaxInProgress — Cap on concurrent in-progress studies per reviewer
- SessionCountTarget — How many candidate annotation sessions are needed before a study is considered fully reviewed for data extraction. This applies to annotation and reconciliation modes only — screening decision counts are governed by the Screening Profile's agreement mode, not by this setting
- SelfReconciliation (default: OFF) — Whether a reviewer who screened the study can also be its screening annotation reconciler
Selection Behaviour by Mode¶
| Mode | "Get next" returns | Pool basis |
|---|---|---|
screening |
Next unscreened study from Stage Study Pool, filtered by policies | Stage Study Pool minus per-reviewer suppression |
annotation |
Next study needing data extraction annotations | Included studies (Final Screening Outcome = Included) |
screeningAndAnnotation |
Next study for combined screening + annotation | Stage Study Pool, reviewer screens then annotates in one session |
reconciliation |
Next study eligible for screening annotation reconciliation | Screening Annotation Reconciliation Pool, respecting SelfReconciliation policy |
Reconciliation: Acts, Workflow, and Stage Settings¶
The screening pipeline involves two distinct reconciliation acts that produce different entities. Only one is implemented at MVP; both are described here for completeness.
| Reconciliation Act | Entity Created | Selection Mode | Implemented |
|---|---|---|---|
| Screening annotation reconciliation | Reconciled Screening (decision + screening annotation) | reconciliation |
Phase 2 (with Screening Annotations) |
| Data extraction reconciliation | Reconciled Annotation Session (authoritative data extraction answers) | Future — own mode or integration with above | Future |
How each relates to stage settings:
- Screening decision reconciliation is automatic (agreement mode) — it doesn't appear as a selection mode because it happens implicitly as screening decisions accumulate. The number of decisions required is governed by the Screening Profile's agreement mode, not by
SessionCountTarget. - Screening annotation reconciliation is the
reconciliationselection mode above — a manual process where a reconciler reviews candidate screenings and creates a Reconciled Screening.SelfReconciliationcontrols whether a reviewer who screened the study can also be its reconciler.HideExcludedStudiesFromReviewersdoes NOT apply in this mode, because reconcilers need to see excluded studies. - Data extraction reconciliation is a future feature that will need its own selection mode or integration with the
reconciliationmode.SessionCountTargetapplies to data extraction — it controls how many candidate annotation sessions are needed before a study is eligible for this reconciliation.
Integrated workflow (future): The two reconciliation acts are always conceptually distinct (different entities, different purposes). However, the user-facing workflow could be integrated into a unified reconciler workbench — shared assignment queues, a single study view covering both screening annotations and data extraction — without merging the underlying data models. This is a UX concern to be addressed when data extraction reconciliation is designed.
API & Integration¶
Filter Set Management (Admin)¶
Filter Sets are managed as part of stage configuration. No separate endpoint — they're persisted as a property of the stage document.
PUT /api/projects/{projectId}/stages/{stageId}— Update stage settings including Filter Set- Validation returns 422 if Filter Set references nonexistent profile IDs, invalid operators, or malformed group structure
- Filter Set changes take effect immediately for subsequent
select_nextcalls (no cache)
Stage Study Pool via Studies Endpoint¶
GET /api/projects/{projectId}/studies?stageId={stageId}— Returns the Stage Study Pool (reviewer-agnostic)- When
stageIdis provided, the stage's Filter Set is compiled to a MongoDB query and applied - When
stageIdis omitted, returns all project studies (existing behaviour preserved) - No per-reviewer suppression applied — this is the raw pool for admin review and stats
Selection (Reviewer-Facing)¶
POST /api/projects/{projectId}/stages/{stageId}/studies/{studyId}/review— Submit screening decision (existing pattern, unchanged)- Selection logic (get-next) applies the Stage Study Pool first, then layers policies (HideExcluded, MaxInProgress, per-reviewer suppression) at selection time
- When no eligible study remains, the endpoint returns 204 No Content — the UI should handle this as "no more studies available" rather than an error
Stats¶
GET /api/projects/{projectId}/stages/{stageId}/stats— Returns counts scoped to the Stage Study Pool:AvailableForScreening,AvailableForAnnotation,ReconciliationEligibleInProgress(caller),Completed(caller),ReconciliationInProgress(caller)
Important scoping distinction: Screening stats (screened count, pending, conflicts) are inherently profile-scoped — a study's screening outcome belongs to its Screening Profile regardless of which stage uses it. The stage stats endpoint resolves the stage's assigned profile and reports screening stats for that profile, but the underlying data is not stage-owned. Annotation and data extraction stats are genuinely stage-scoped. See Screening Profiles — Stats Are Profile-Scoped.
Legacy Compatibility¶
- Stages without a Filter Set have an implicit empty pool = all project studies (current behaviour)
- No migration needed — existing stages continue working as-is
- Filter Sets are purely additive; admins opt in when they configure multi-stage pipelines
Performance & Indexing¶
Required Indexes¶
The following compound indexes support the Filter Set compilation and selection queries:
| Index | Purpose |
|---|---|
screeningOutcomes.profileId + screeningOutcomes.status |
$elemMatch for profileOutcome rules |
screeningOutcomes.profileId + screeningOutcomes.decisions.reviewerId |
Per-reviewer suppression in selection |
Study-level rand field (ascending) |
Efficient random selection (see below) |
Phase 3 (annotation-based filtering) will require additional indexes on reconciled annotation values — likely flattened fields (reconciledAnnotations.<questionId>) rather than wildcard indexes, to maintain selectivity.
Random Selection Strategy¶
For "get next" to select a random study from the filtered pool efficiently, avoid $sample after a large $match pipeline (CPU-heavy scan). Instead, store a precomputed rand field (double in [0, 1)) on each Study document:
r = random()
q1: match(pool & rand >= r) sort(rand ASC) limit(1)
if none: q2: match(pool & rand < r) sort(rand ASC) limit(1)
This uses the rand index for an efficient range scan. The wrap-around query (q2) handles the case where r is near 1.0.
Known Performance Risks¶
| Risk | Trigger | Mitigation |
|---|---|---|
| Multikey fan-out | Many profiles across large corpus → large index entries per study | Pre-filter by profileId in the $elemMatch; keep profile count per project reasonable |
| Deep OR trees | Complex Filter Sets with deeply nested OR groups | Simplifier flattens redundant nesting; push most selective branches first in compiled pipeline |
Observability¶
Key metrics to emit for selection and filtering:
select_next.duration_ms— selection latency (basis for p95 target)select_next.mode— which selection mode was usedselect_next.filtered_out_reason— why a study was skipped (suppression / hiddenExcluded / maxInProgress)filter_compile.duration_ms— Filter Set compilation timestage_study_pool.count— pool size per stage (for dashboard monitoring)
Collect a 2-week baseline post-launch before setting SLO alert thresholds.
Success Criteria¶
-
select_nextp95 < 400 ms with Filter Set compilation - Stage Study Pool counts are consistent with Filter Set rules (verified by automated tests)
- Legacy stages with no Filter Set behave identically to current behaviour
- Admin can configure a two-stage pipeline (TA → FT) in ≤ 5 minutes
Related Documents¶
- Screening Profiles — Named screening criteria configurations
- Screening Annotations — Structured exclusion reasons and reconciliation
- Stage Settings — Per-stage configuration policies layered on top of the study pool
- Annotation Versioning — Foundational versioning pattern for all entities
- Question Management — Question lifecycle from draft through versioning
- Reconciliation — Annotation reconciliation workflow and authority determination
- Annotation Form V2 — Rebuilt annotation form with per-question auto-save