KORA: Building an Open-Source Benchmark for AI Child Safety — Methodology, Findings, and the Road Ahead

On the 8th of June 2026, Generation AI hosted a webinar Stéphie Herlin, Product & Research Lead Korabench.ai. Dr Anthony Bridgen, project coordinator for Generation AI, reflects on this.

Half of teens use AI companions regularly, with 30% doing so every day. New and more capable frontier models are being released every year and apps are increasingly leveraging them. Despite this, there is a notable lack of high-quality data from large-scale, real-world research on how these tools impact young people. Essentially, we are running a huge, uncontrolled experiment on young people, releasing tools which have not been built with their safety in mind, but which are being used by them en masse. As such, there is a real need for mechanisms to evaluate the performance of AI chatbots with regard to the way in which they expose young people to a huge array of risks, ranging from cognitive atrophy to sexual content.

Enter KORA, a non-profit benchmarking platform launched in February 2026. By generating synthetic interactions between frontier models and an LLM posing as a child, and judging these with three separate LLMs, they have been able to analyse 100,000 conversations to evaluate 35 models against a taxonomy of 26 risks. By having experts review a sample of these conversations, they have been able to validate whether simulations of child behaviour are reasonable and identifications of harm correct and reflective of established principles. Their results indicate that there do appear to be considerable differences in the risks posed by mainstream models, with open-source models such as Mistral performing particularly poorly. Whilst obvious harms such as sexual content tend to be guarded against fairly robustly, more subtle risks such as educational integrity and developmental risks are much more prevalent.

This approach, whilst a valuable starting point, does have limitations. Firstly, it is an entirely synthetic approach and as such it is difficult to verify the ecological validity of outcomes. Secondly, whilst the risk taxonomy is well-developed, we have little understanding of the prevalence of these risks in real-world conversations between children and chatbots. Thirdly, due to the character of the ecosystem itself, the benchmark is inherently US-centric and as such may not be reflective of values and cultural norms outside of this.

Key then to iterating and improving benchmarks is data. Such data already exists thanks to young people naturally interacting with LLMs but is held by the companies developing these models. Access to this vast corpus of interaction traces would be invaluable to identifying the types and prevalence of risk, and which models perform best. Failing this, data donation by young people themselves could provide a route to grounding benchmarks in the real-world. Until this is achieved, it is difficult to know how meaningful attempts to rank frontier models really are.

This data must span languages and cultures such that benchmarks apply outside of the reference terms of English-language dominated Western culture. It is well established that AI models are reflective of the data they are trained upon, benchmarking efforts must find a way to reflect this.

As a whole new ecosystem of benchmarks emerges, ParentBench, L2-Bench and ImpactBench to name a few, collaboration will be key to identify common risks, shared approaches and best practice across the board.

If you are interested in hearing about future seminars, reach out to global.challenges@reuben.ox.ac.uk

Next
Next

When AI Feels Social: Developmental Mechanisms, Vulnerability, and the Risks of Emotional Dependency