DanCorwin/ClinicalTrialMatchingRequirements

DRAFT Requirements for a POC User Interface

The needs of Patient Recruitment (see "Use Case Scenarios") are documented. The "COI Architecture" suggests they be supplied by interacting web services for (1) query forms, (2) a generic API for EMR systems, and (3) a semantic search engine for files used by (1) and (2).

Feedback of all sorts is welcome on this DRAFT spec, which imposes implied requirements on each module above. Hoping to reconcile these with "scoping" objectives in POC project plans, I have split up the UI functionality into 3 staged levels of service that can be added sequentially:

A) Query Formation - map a protocol into a "target" file for EMRwrappers
B) Manage Targets - Save/run these files; find those using any predicate(s)
C) Get Candidates - Find "matching" patients; set up User-alerting services

For a POC we need only level [A] code plus reasonable specs for [B] and [C]. Deployable beta code at level [B], however, should be customizable to real EMR sites. Finding willing beta testers probably takes a promise of level [C] functionality, conditional on successful QA tests, plus extra use cases such as Knowledge Acquisition and "usage spec" lexicons for all major corpora.

DagServer engine (1 and 2 below)

This service offers advanced search and graph-matching across indexed corpora, plus metadata storage and rule-based filtering of (e.g.) matching candidates for a target. As a background service, it is not user visible, but it helps all levels by providing a framework for saving and applying "Clinical Trials" predicates to the "Main Patient Index" (MPI) of a caregiver.

CORPORA (what data collections must get stored?)

1. a web based repository will exist for documents posted to various corpora, able to return named graphs of 1, 2, or 3 sorts, depending on the UI "level" :

A) predicates - a quasi-standard tree of documented constraint predicates (8)
- their descriptors come from standards that COI-groups want to relate
- they get "bridged" by English spellings naming EMRwrapper URLs
B) targets - prototype patient models registered by some COI Agency end-user
- each has basic "(+)" documentation that models its protocol URL
- complex constraint predicates can get added by user via web forms
C) candidates - Patient summaries in corpora from a real (or simulated) MPI
- these models are periodically exported from some EMR source (4)
- each need only offer basic "(+)" descriptors and an ID in local EMR

MATCHING (how are result-lists of matches obtained?)

2. A secure HTTP query API must exist for (1), by which a registered user can get back all record IDs that "match" a given graph. The power to cross-query any corpus by using indexed graphs in another corpus is what enables "recruitment" matches. It gets increasingly useful as UI levels increase and in other use cases. In all cases, results meet these additional criteria:

they exist in a user-specified list of corpora
they changed in a specific time period (default=since I last looked)

COI Agency (3 through 7)

This web-based User Interface mediates beween DAGserver and EMRwrapper, controls all matching logic. At higher levels, its also offers user training services, aids system knowledge acquisition, and helps with fine grained target management. At all levels, however, it lets a clerical user compose targets by searching for their protocol-directed predicates, documented and indexed by using each "domain ontology" of potential interest (HL7, DCM, SDTM).

MATCHING (how are matches sought and processed?)

3. The core matching process works from concepts extracted for semi-structured English text used to model the predicates in some domain ontology. They may also model textual constraints on patients found in some protocol texts (such as those found at http://clinicaltrials.gov), and get matched using (2).

Standard predicates for COI Agency or EMRwrapper constrain all matches
- They are implemented within user-installed predicates
- They are annotated and triggered via standard UI vocabularies (1A)
UMLS is a site-wide "bridge" ontology; with subsets [TBDL] at higher levels
- At any level, the simplest desired subset to find likely involves HL7
- DCM/SDTM concepts must be registered within Agency lexikons
Aids to matching [stop words, AND/OR/NOT/IF forms, etc] are expected

IMPORTING (how can records be added or changed?)

4. a secure import API will exist for (1), into which a registered user or process can HTTP-post a report on corpus changes - new or adjusted records, with their IDs:

such posted data must be in a supported triples-based format [list TDBL]
each posted triple must be expressable in the ontology supported by (2)
at level [C], service (5) must pull candidate exports from service (8)

5. a browser-based User Interface will drive APIs (2) and (4) to best operational advantage, according to the type and preferences of its registered user. It will:

help a user specify inputs compatible with his or her query rights
help a user specify a time range for matches (default: "since I last looked")
return a list of matching records; in [C], it also actively aids filtering by (8)

MAPPING (how can optional foreign ontologies be supported?)

6. to enable imports in corpus-specific vocabulary, we can support an ontology mapping process into (3) that uses corpus-specific extension lexicons:

each such lexicon may (re)define terms in a site-local vocabulary
they expand into string patterns of (3) before imports go into (4)

7. to simplify management of corpus-specific extension lexicons, (6) also supports an interactive machine-learning capability for them, by which

some unknown terms in an import record get logged as serious user errors
a willing user may repair a bad spelling and restart the import OR ..
the user may instead add the spelling to a corpus lexicon, then restart

EMRwrapper (8 and 9 below)

This "back end" should provide various output files that report information on its current implementation status and/or its evaluation of "target" files for EMR data in some real or simulated MPI.

MATCHING (how are matches found or filtered by constraints?)

8. To ease future growth, I/O designs must persistently focus on a "target" file citing scripted constraints. It must be checked against potential patient "matches", and the results returned in a file for inspection by a user.

A) Each patient-property test being implemented must be globally documented by standards (1A). The UI team needs such documentation in ASCII files that separately show its name(s) and an English description for each ontology being mapped, plus a brief English summary of each patient-property, suitable for inclusion in some HTML pull-down list.
B) Must apply a target file to each patient in a list and return a list of T/F evaluations for each constraint tested. They test several real or simulated EMRs in a sample, and honor these special features:
- Any predicate not yet implemented should return an error message
- Constraints may decompose into tests linked by AND/OR/IF (3C)
- Weakening criteria can be handled by adjusting some IF-test level
C) Similar to B, except additional requests let a caller cite specific patient IDs to be tested, plus find all (anonymized?) patient IDs meeting various conditions [details TBDL]

IMPLEMENTATION (how are the output files actually produced?)

9. Little specific is required, but guidelines are suggested below.

A) At this level, LSW or something similar is suggested, with any convenient set of patient records The web API is optional here and in [B]. Files of documentation may come over any convenient channel for indexing, including email.
B) The above must now be bundled into a download that EMR system staff can easily install locally, then regularly configure new predicates as needed. The doc of level A must now be flagged to show when (or if) each standard property predicate was actually installed locally.
C) Here we need a web API that reacts ONLY to pre-selected IP addresses. It lets complex constraints on patients be checked by ANY program having secure web access, coded in ANY language, and running on ANY platform - which is excellent exposure. Recurring tests suggest a regular data feed of EMR changes at each caregiver, which can drive constraint checking as updates occur.