STREAM (ChemBio)

A Standard for Transparently Reporting Evaluations in AI Model Reports

Introduction

What is STREAM (ChemBio)? STREAM is meant to be a clear and standardized reporting framework that details the key information we view as necessary to allow third parties to understand and interpret the results of dangerous capability evaluations. Note that STREAM addresses the quality of reporting, not the quality of the underlying evaluations or any resulting risk interpretations.

It is designed to be both a practical resource and an assessment tool: companies and evaluators can use this standard to structure their model reports, and third parties can refer to the standard when assessing such reports. Because the science of evaluation is still developing, we intend for this to be an evolving standard which we update and adapt over time as best practices emerge. We thus refer to the standard in this paper as “version 1”. We invite researchers, practitioners, and regulators to use and improve upon STREAM.

Why is STREAM (ChemBio) needed? Powerful AI systems could provide great benefits to society, but may also bring large-scale risks (Bengio et al, 2025; OECD, 2024), such as misuse of these systems by malicious actors (Mouton et al, 2024; Brundage et al, 2018). In response, many leading AI companies have committed to regularly testing their systems for dangerous capabilities, including capabilities related to chemical and biological misuse (METR, 2025; FMF, 2024b). These tests are often referred to as “dangerous capability evaluations”, and they are a key component of AI companies’ Frontier AI Safety Policies (FSPs), voluntary commitments to the US White House (White House, 2023), and the EU General-Purpose AI Code of Practice (European Commission, 2025).

Despite their importance, there are currently no widely used standards for documenting dangerous capability evaluations clearly alongside model deployments (Weidinger et al., 2025; Paskov 2025b). Several leading AI companies and AI Safety/Security Institutes regularly publish dangerous capability evaluation results in “model reports” (also called “model cards”, “system cards” or “safety cards”), and cite those results to support important claims about a models’ level of risk (Mitchell et al, 2019). But there is little consistency across such model reports on the evaluation details they provide.

In particular, many model reports lack sufficient information about how their evaluations were conducted, the output of the evaluations, and how the results informed judgments of a model’s potentially dangerous capabilities (Reuel et al, 2024a; Righetti, 2024; Bowen et al., 2025). This limits how informative and credible any resulting claims can be to readers, and impedes third party attempts to replicate such results.

What is NOT covered by STREAM (ChemBio)? We focus specifically on benchmark evaluations related to chemical and biological (ChemBio) capabilities. Benchmark evaluations are common in model reports, and are methodologically distinct from other types of evaluation (e.g. human uplift studies, red-teaming), while still having many overlapping considerations with such evaluations. Misuse of chemical and biological agents is well-studied in national security and law enforcement contexts (NASEM, 2018; NRC, 2008), and there appears to be more consensus on frontier AI capability thresholds for this topic than for many others (FMF, 2025b). However, many considerations in this reporting standard also apply to AI evaluation in other domains, such as cybersecurity or AI self-improvement, and it would be relatively straightforward to extend it to cover such domains.

Additionally, we intend for this version of STREAM to be applied to individual evaluations that are reported in a single document (e.g. a model report). For a model report to comply with a criterion, the information required by that criterion must be explicitly included in this document; Information reported elsewhere but not reproduced in the model report will not count toward compliance. For example, even if the authors of a benchmark include human baselines in the original benchmark paper, if a company’s model report omits these when reporting on this benchmark, it will not be in compliance with the “Baseline Performance” criteria (5.4). This is because model reports should provide third parties with one clear compilation of the evidence—providing a more consistent picture to third party readers, and ensuring that important information does not “slip through the cracks”.

STREAM (ChemBio)

Threat Relevance

Criterion	At a minimum (0.5 pt)	Full credit (1 pt)
(i) Capability being measured and relevance to threat model	Describe the capability the evaluation measures Describe the threat model the evaluation informs* State the type of actor in that threat model* State the misuse vector in that threat model* State the enabling AI capability of concern	Give a brief justification of why the evaluation is a suitable measure (this could be an explanation of why the skills measured by the evaluation matter)§ Where applicable, state whether this evaluation is a "rule-in" or "rule-out" evaluation for this capability, and what performance threshold is used to determine this.§
(ii) Strength & nature of evidence provided by evaluation and performance threshold	Say whether a score could be taken as strong evidence that it lacks or possesses the capability of concern OR if it is not core to the safety assessment. (Note that a rule-in or rule-out threshold counts as a judgement of the strength of evidence an eval provides).	Specify whether the eval has a quantitative threshold that rules out or rules in a capability State those quantitative thresholds are (eg a particular x% score on the eval or exceeding a human baseline)§ If a threshold is set, give a brief justification for those numbers§ State whether these thresholds were defined before or after the evaluation was run.§ Where applicable, if an evaluation was created by an external party, and if the interpretation of performance thresholds differs from that of the original evaluation designer, report that this is in fact the case.§
(iii) Example item & answer	Provide at least one example test item (it may be redacted, as long it still contains enough detail to illustrate task complexity) Provide a sample answer to that item (as above wrt redactions)	State whether the example is representative of test difficulty If (and only if) the example is not representative of test difficulty, explain the limitations of the example.

* This can be stated just once in the model card for a suite of evaluations – note that, very frequently, a model card will simply identify the core threat model(s) once in an introductory section to the CBRN evaluations as a whole. This is fine and should still be awarded points*.

§ This is not necessary if evaluators explicitly disclose in the report that the evaluation is not a significant contributor to the model safety assessment. In that case, the “minimum” is sufficient for full credit. Additionally, these full credit requirements can be automatically count as met if, instead of labelling individual evals as rule-in or -out, both of the following conditions are met:

the evaluation report elsewhere explains that no single evaluation is capable of ruling a capability in or out*; and
the evaluation report explains at one point explains the overall configuration of evidence that would either “rule in” or “rule out” capabilities, as required by 5.6(i) and 5.6.(ii).

Test Construction

Criterion	At a minimum (0.5 pt)	Full credit (1 pt)
(i) Number of items tested & relation to full test set	State the number of items the model was tested on (or if it is a long-form task, the number of independently evaluated stages of the task in the test)	If only a subset was used, then state the full test-set size If only a subset of the test was used, describe how the subset was chosen
(ii) Answer format & scoring	Describe required answer format(s), such as whether the eval consisted of multiple-choice, short-answers, or open-ended generative tasks	Where applicable, flag any important details of scoring that would not be obvious to readers, and would be needed for replication. If the evaluation was designed by a third party AND if (and only if) any changes were made to the designer's recommended scoring or testing methodology, reports must explicitly acknowledge such differences and provide a brief justification for them.
(iii) Answer-key / rubric creation and quality assurance	Describe how the answer keys or grading rubrics / criteria were developed (specifically, how they were created - not how the grading itself was done) If the answer keys were developed by experts, report their qualifications If the answer keys were developed by experts, report the number of experts involved	Describe the process for validation or quality-control of the key was performed, or explicitly specify that this was not done. Where applicable, explain how ambiguous items were handled.
(iv-a) Human-graded: grader sample and recruiting details	Give the graders' domain qualifications	State the number of graders Describe how graders were recruited State any training provided, or state that there was no such training
(iv-b) Human-graded: details of grading process and grade aggregation	Describe grading instructions State whether graders saw both LLM and human answers State whether graders were blinded	Explain how grader disagreements were resolved (simple average, majority vote, intervention of senior experts, etc) State the number of independent grades per item Give typical time spent per question
(iv-c) Human-graded: level of grader agreement	State if grader agreement was high or low (or some other qualitative assessment of grader agreement)	Provide an agreement statistic (eg, Cohen's kappa etc) Flag disagreements that affect safety conclusions OR explicitly state that there were no such disagreements.
(v-a) Auto-graded: grader model used and modifications applied to it	Name the base model used as autograder	State whether the model was fine-tuned or otherwise Where applicable, if and only if model was modified, briefly describe how this was done.
(v-b) Auto-graded: details of auto-grading process and rubric used	Outline the process by which the autograder produces scores (eg, does it reward similarity with "gold standard" answer examples) Share a brief description of any grading rubrics Share a brief description of grading instructions	Provide one grading prompt example (redacted if needed) State whether multiple auto-grader samples were used by the grader to generate final scores If multiple samples were created, state the aggregation rule for multiple scores
(v-c) Auto-graded: validation of auto-grader	State whether autograder was validated against human experts, another autograder, or not at all. If and only if validated against humans, describe the expert sample (number and qualifications)	If validation was done, provide an agreement statistic If validation was done, state whether validation covered full set or subset Where applicable, if and only if validation was not done, provide a justification for this.

Model Elicitation

Criterion	At a minimum (0.5 pt)	Full credit (1 pt)
(i) Model version(s) tested and relationship to launch model	Label every model version evaluated† Specify whether any tested models are identical to the deployed model† Say if safeguards were on or off	If only earlier or alternative snapshots were tested, provide some estimate capability gap (through a brief qualitative description, a comparison of model performances on a public benchmark, or another suitable technique).†
(ii) Safeguards & bypassing techniques	If evaluations involve models with safeguards applied, list the active mitigations during testing If evaluations involve models with safeguards applied, state whether jailbreak or bypass attempts were made	If evaluations involve models with safeguards applied, describe the rigor of bypass attempts (eg, in time spent) with relation to the threat model considered OR justify that no bypassing efforts were necessary.**
(iii) Elicitation methods used	List scaffolding or tool integrations (requirement waived if scaffold or tool integrations cleary unnecessary, eg for multiple-choice or short–form answers) List sampling / generation strategies or lack thereof List resource ceilings (tokens, time) or lack thereof List any fine-tuning datasets applied (or lack thereof - note that if the model card makes clear that the final launch model is used, this makes clear no fine-tuning is used) List prompting techniques employed (or lack thereof)	Give details sufficient for third-party replication OR confirm that details have been shared and reviewed by an independent organisation. For example, if, fine-tuning is used, include a description of the dataset; if particular prompting techniques were used, include example prompts or design process.

** The approach to testing safeguarded models (including eg the active safeguards and the bypassing strategies) can be reported just once in the model card, so long as the reader can clearly distinguish which evaluations this applies to (if this is not made clear, the point is not awarded).

† These criterion can be met if the model card makes clear elsewhere that, unless otherwise specified, the models tested for each evaluation were of a particular type and specifies the relation of that type to the deployed model. Be sure to search the entire model card very comprehensively in order to see whether such statements exist and whether they apply to the particular eval you are scoring.

Results Reporting

Criterion	At a minimum (0.5 pt)	Full credit (1 pt)
(i) Main scores	Report the most relevant summary statistic (eg, mean or max score)	Report those numbers either in the main text, in a table, or in a figure with clear text labelling.
(ii) Uncertainty & number of benchmark runs	Provide an uncertainty metric for each key statistic (eg, Confidence Intervals or standard error - if its a Confidence Interval, the Confidence Level - eg, 95% - must also be reported). Can be provided in-text or in a figure.	State the number of full-benchmark runs used per evaluation.***
(iii) Ablations / alt. conditions	State whether supplementary runs with major variations on the baseline evaluation conditions (eg different elicitation, resource ceilings, or test versions) were run. Where applicable, if and only if such major variations where run, provide a clear breakdown of the results for each major testing condition.	Provide all stats in text, in a table, or in a graph with clear text labelling.

*** This can also be met if the model card makes clear elsewhere that each evaluation is run a given number of X times or, alternatively, specify a different method for establishing confidence intervals (such as a “bootstrap procedure”).

Baseline Results

Criterion	At a minimum (0.5 pt)	Full credit (1 pt)
(i-a) Human baseline: sample details	State sample size of baseline State key qualifications of sample (including not just "expert", but specific domain of expertise, education level and/or relevant experience).	Briefly describe how the sample was recruited Note sampling biases
(i-b) Human baseline: full scores and uncertainty metrics	Give human performance statistic(s) similar to that provided in 5.4.i.	Provide an uncertainty metric equivalent to that provided in 5.4.ii Where applicable, note any task differences vs. the AI test Explicitly report whether baseline samples were tested on the full test or a subset of it.
(i-c) Human baseline: details of elicitation, resources available and incentives provided	State time allowed for the task List tools or resources humans could use (eg, internet access, biological design tools, etc).	Describe incentives offered Report actual time spent on typical questions Note any issues with testing environment or performance (such as compliance issues) or absence thereof
(ii-a) No human baseline: justification of absence	If no human baseline is provided, give a brief justification for omitting humans (eg, too expensive, uninformative, or already exceeded by previous models)	State whether any consultations were done or evidence supporting that decision was considered
(ii-b) No human baseline: alternative to human baseline	Provide another comparison point (e.g. earlier model checkpoints, scores from other publicly released models, or surveys of SMEs)	Explain why that baseline is meaningful Discuss major uncertainties with the alternative baseline

Safety Interpretation

Criterion	At a minimum (0.5 pt)	Full credit (1 pt)
(i) Interpretation of test outcomes, capability conclusion and relation to decision-making	State the overall capability conclusion that evaluation results report**** Briefly describe how the conclusion affects developer decisions (eg whether the model crossed any capability thresholds in their Safety Framework or equivalent document)	Qualitatively of quantitatively explain degree to which specific evaluations contributed to this conclusion (note that this is often done by referring to the capabilities specific evaluations measured, rather than by referring to evaluations by name - this is acceptable). Disclose other important evidence streams, such as those performed by external parties or holistic red-teaming.
(ii) Details of test results that would have overturned the above capability conclusion and details of preregistration	State the configurations of evaluation results that would have overturned the overall capability conclusion reached State whether these criteria (for "falsification" of the overall capability conclusion) were pre-registered with a credible third party.
(iii) Predictions of future performance gains from post-training and elicitation	Provide some quantitative or qualitative statement about how future model improvement might improve which includes the implications of this improvement on risk levels, accounting for both post-training improvements and near-future model releases.	Directly tie those projected improvements to decision points (such as capability thresholds) of the company's Frontier AI Safety documentation and predict a crude time frame for reaching them
(iv) Time available to relevant teams for interpretation of results	Provide some statement about how long relevant teams had to form and communicate interpretations of results before deployment	Provide a rough quantified estimate of this time
(v) Presence of internal disagreements	Explicitly state whether any team-members involved with testing disagreed with the above capability conclusion, or agreed with important caveats (if there was no disagreement, this must also be reported). If such disagreements existed, summarise the nature of those disagreements	Explain how disagreements were handled OR explain how they would have been handled, had they occurred.

†† Criterion in this section can be reported once per report, rather than once per evaluation.

**** Note that one way you can meet this criterion is to have clearly marked every single CBRN evaluation as a “rule-in” or “rule-out” evaluation with a clear and comprehensible performance threshold; if this is done, then it is clear that falsifying results would have been inferred from rule-in or rule-out evaluations falling on the other side of that performance threshold, though in that case the model card should ideally explain what the developer would do in the rare case that a rule-in and a rule-out threshold explicitly contradicted one another.

Endorsements

Greater transparency into AI risks is essential for responsible innovation. This requires that model evaluations are reported clearly, providing the public with enough information to make judgments. We're glad that the STREAM standard exists and hope to see future model cards adopt it.

— UK AISI

Authors

Tegan McCaslin	Independent
Jide Alaga	METR
Samira Nedungadi	SecureBio
Seth Donoughe	SecureBio
Tom Reed	UK AISI
Chris Painter	METR
Rishi Bommasani	HAI
Luca Righetti	GovAI

For email correspondence, contact Luca Righetti (luca.righetti@governance.ai)