PRISMA Specification & Data Model Constraints¶

Summary¶

Phase 2 established the blueprint that every future phase must follow. We documented exactly how SyRF's data must be structured to produce PRISMA 2020 flow diagrams -- the standardised reporting format that systematic reviews are required to include. This specification prevents us from building features that would later break our ability to generate these reports.

The Problem¶

PRISMA 2020 flow diagrams require specific data about how studies were identified, screened, and included in a review. If we build our features without thinking about PRISMA from the start, we risk creating data structures that cannot answer the questions PRISMA asks -- for example, "how many records were identified from each database?" or "how many duplicates were removed before screening?" Retrofitting PRISMA support after the fact would require painful data migrations and could compromise data integrity.

What We Built¶

We produced five specification documents that constrain all future development:

PRISMA box-to-field mapping -- For each of the 17 boxes in a PRISMA flow diagram (covering 34 data fields), we defined exactly which data in SyRF produces that number. This means when we build PRISMA reporting in the final release, every required number is already computable from the data we have been collecting all along.
Three-level data model -- We introduced three distinct layers for organising citation data:
Reference: A global identity for a piece of research (like a library catalogue entry that exists independently of any project)
Import Record: A permanent record of exactly what was imported from a specific search (like a receipt -- never changed after creation)
Study: The working entity that reviewers interact with -- screening, annotating, and reconciling

This separation means SyRF can correctly count "records identified" versus "studies included" -- a distinction PRISMA requires but our current system cannot make.

Study lifecycle model -- We defined nine states a study can move through (Active, Duplicate, Pending Review, Included, etc.), with clear rules for which PRISMA box each state feeds into.
Source type taxonomy -- We categorised where searches come from (databases like PubMed, registers like ClinicalTrials.gov, websites, organisations, citation searching, other) so PRISMA's dual-column layout can be automatically populated.
Phase-by-phase constraint annotations -- For every future phase (3 through 16), we documented which PRISMA rules it must follow, plus validation checklists for each release.

Why This Matters¶

The PRISMA 2020 flow diagram is not optional for systematic reviews -- it is the standard reporting format expected by journals, funders, and regulatory bodies. By defining the data requirements before any code is written, we ensure that SyRF will be able to auto-generate compliant reports from the data it naturally collects during a review.

This is a significant competitive advantage: most systematic review tools require manual report assembly, whereas SyRF will generate PRISMA diagrams directly from the review data.

How It Connects¶

Phase 2 is the gate that every subsequent phase must pass through. Before building anything in Phases 3-16, engineers consult these specifications to ensure their work supports PRISMA compliance. The three-level data model informs how we structure data in Phase 3 (Collection Infrastructure), how we handle deduplication in Phase 12, and how we generate PRISMA reports in Phase 16. Without this phase, individual teams would make data structure decisions in isolation, and we would discover incompatibilities only when trying to generate the final reports.