AI risk tests expose hidden flaws in generative AI
Generative artificial intelligence can rank candidates, answer customer questions or support internal decision-making. That speed is useful but risky when the same system behaves differently after small changes in wording, language or context. Companies must find failures and prove with evidence that risks have been tested and managed. The EIC-funded QuantPi(opens in new window) project developed a platform for generative artificial intelligence risk management. Its PiCrystal technology automatically creates test suites, checks model behaviour and turns results into documentation and audit-ready evidence linked to rules such as the European Union Artificial Intelligence Act(opens in new window).
AI risk tests reveal unstable behaviour and bias
The first problem that becomes apparent is often not striking, but rather a basic inconsistency. As Lukas Bieringer, head of policy and grants at QuantPi, explains, “It is typically inconsistent or unstable behaviour under realistic variation of inputs – the same prompt class yields materially different outputs depending on phrasing, language or context.” That matters because a generative AI tool may appear reliable in a simple benchmark yet fail when real users phrase requests differently. Bias and fairness gaps can then emerge, especially when subgroup performance appears acceptable in average results but breaks down within specific groups. One proof-of-value case assessed a large language model-based candidate recommender system on the recruiting platform Stepstone and through TÜV AI.Lab an AI assurance laboratory. The lesson: employment-related AI testing needs datasets large and representative enough to support intersectional testing, where overlapping characteristics can be checked rather than hidden inside broad averages.
One evidence base for technical, legal and board users
The approach of QuantPi separates evidence from presentation. A data scientist may need detailed test results by metric, subgroup and scenario. A legal expert needs links to regulatory clauses and standards. A governance lead needs a portfolio view across systems. A board-level decision-maker needs a few residual-risk indicators without false precision. Bieringer summarises the approach clearly: “The key design principle: all views must derive from the same statistical evidence base, so that a board-level statement can always be traced to a specific test result.” This traceability is important because risk decisions involve multiple teams, each of which needs a view that aligns with its responsibilities.
Continuous monitoring keeps AI evidence current
Automated testing does not remove human judgment. People still define the system’s intended use, choose which fairness or safety definitions apply, set acceptance thresholds and decide whether to deploy, delay or withdraw a system. Automation measures risk; accountable people decide what level of risk is acceptable. Monitoring is therefore triggered when risk evidence changes. A model update, revised prompt, new retrieval index, modified tool set or change in upstream data can invalidate earlier tests. Input drift, output drift and changing rules or internal policies may also require reassessment. For many companies, the biggest remaining obstacle is not awareness of the rules. Bieringer is forthright about it: “What they cannot do is produce technical evidence of conformity that holds up in front of a notified body or auditor – particularly for generative systems, where traditional benchmark reports are insufficient and harmonised standards are missing to date.” QuantPi’s platform addresses that gap by converting technical tests into evidence that different teams can use before and after deployment.