Measuring impact
Evaluating the ability of LLMs to quantify consequences
As AI systems grow more powerful and autonomous, their ability to weigh real-world consequences becomes crucial for achieving alignment with human goals. In this tech report, we publish results of our evaluations of leading large language models (LLMs) ability to consistently quantify and compare a wide range of impacts—from greenhouse gas emissions and resource use to meaning and joy. The results reveal not only how far LLMs have come in grasping complex cause-and-effect relationships, but also where they continue to stumble, often due to inherent biases in their training data.

Juho Ojala
CTO & Co-Founder, Upright
Published Apr 8, 2025
AI alignment—ensuring that AI systems act in harmony with human values and goals—is widely regarded as the foremost challenge of our era.
Current AI systems approach alignment with a deontological approach that judges actions based on rules, duties, and principles, rather than consequences. This approach, however, is expected to fall short as AI systems grow more autonomous and surpass human intelligence in critical domains.
Replacing the deontological approach with a consequentialist approach will require AI systems to be able to accurately evaluate and compare the scope of potential outcomes of any action.
Since 2017, Upright has been working on consequentialist alignment of the private sector. For that end, Upright produces data on the impact of companies at scale, comprehensively considering all relevant positive and negative impacts that a company may have. Upright's public database provides data on more than 30,000 companies and draws upon hundreds of millions of scientific articles, statistical databases, and company-reported information.
Starting with Google's BERT - the first publicly available large language model - in 2018, Upright has been using large language models to process this vast amount of information. Doing this in a way that does not involve naively trusting LLMs requires effective automated benchmarks that evaluate LLMs relevant capabilities. One such evaluation tests the ability of large language models to assess and compare the magnitude of different impacts. Recognizing that this capability is key not only for our own work but also for achieving alignment of future AI systems, we have chosen to release the results of this evaluation for the public good.
Imagine being asked to gauge whether the greenhouse gas emissions from steel production are more or less significant than its contribution to societal infrastructure, or whether the delight we derive from chocolate outweighs its negative health impacts.
Now, extend this thought experiment to thousands of such comparisons across hundreds of different impacts—and you are required to answer in a way that remains logically consistent. For example, if you conclude that impact A is ten times greater than B, and that C is five times greater than A, you must also conclude that C is fifty times greater than B—all without remembering your previous answers.

Performing well in this task requires (1) sophisticated and consistent counterfactual reasoning, (2) balanced and nuanced causal attribution, (3) a coherent value system, and (4) robust handling of uncertainty and vagueness.
Additionally, it also requires accepting that such comparisons need to be made. This is not a problem for LLMs that are trained to have a can-do-attitude, but humans tend to struggle with this. We have written on this topic separately in our FAQ (see the section about comparing apples to oranges).
Within our CoCoCo (Consequence Magnitude Comparison Consistency) evaluation, large language models must perform thousands of such comparisons, generating an aggregate set of implied impact sizes for each impact topic.
For example, the table below lists the implied impact sizes for social media platforms, produced from thousands of pairwise comparisons done by Claude 3.7 Sonnet:
Topic | Size Sort column |
|---|---|
facilitates communication | 100.0 |
connects individuals around shared interests | 81.9 |
enables shared activities | 67.6 |
distributes knowledge | 60.0 |
enables distribution of knowledge | 52.9 |
distributes information without transparency on sources | 51.7 |
exploits personal data | 49.9 |
enables propaganda | 48.5 |
enables surveillance | 46.2 |
provides tools for knowledge sharing | 46.0 |
Upright impact category | Upright impact topic | Implied impact size Sort column |
|---|---|---|
Relationships | facilitates communication | 100.0 |
Relationships | connects individuals around shared interests | 81.9 |
Relationships | enables shared activities | 67.6 |
Distributing knowledge | distributes knowledge | 60.0 |
Knowledge infrastructure | enables distribution of knowledge | 52.9 |
Distributing knowledge | distributes information without transparency on sources | 51.7 |
Equality & human rights | exploits personal data | 49.9 |
Societal stability | enables propaganda | 48.5 |
Equality & human rights | enables surveillance | 46.2 |
Knowledge infrastructure | provides tools for knowledge sharing | 46.0 |
We then use the implied impact sizes to measure how well the individual pairwise comparisons align with these aggregate sizes as a gauge of internal consistency of an LLM's understanding of consequences.
For ease of interpretation, we calculate a metric that reflects what share of comparisons results were close to the expected result based on the implied impact sizes. This yields a performance metric for internal consistency of the responses as a percentage.
(Technically, we define "close" by the ratios transformed to log space provided by the pairwise comparison and implied impact scales being closer than 0.5 standard deviations)
Of the tested models, Anthropic's Claude 3.7 Sonnet in thinking mode performed best, with OpenAI's o1 coming in second, and the Chinese DeepSeek R1 model coming in third.
We ran two evaluations for Claude 3.7 Sonnet, one with thinking disabled, and one with a thinking budget of 2000 tokens, allowing the model to think before answering. Claude 3.7 performed slightly better in the thinking mode. The other thinking models were run with the default settings, meaning a medium reasoning effort for OpenAI's o3 and o1 models.
Models selected include frontier models from leading providers, along with all frontier models from OpenAI since GPT3.5 (turbo) published in early 2022. OpenAI's GPT4.5, which is in pre-release status, was not included for cost reasons. The total number of pairwise comparisons by model was 3,200.
In terms of price-performance ratio there is no clear winner, though newer models perform better overall. Note that the the scale on the horizontal axis is logarithmic.
Plotting the performance as a function of the (original) release date of each model shows how performance has improved since the original release of ChatGPT (GPT 3.5).
(Strictly speaking OpenAI initially released ChatGPT based on GPT3.5 instead of GPT3.5-turbo, but they are very similar models in terms of performance. The original GPT3.5 model is no longer available).
Despite the progress, the eval remains challenging for even current frontier models. In an easier version of the eval, where we require only the ordering of the impacts to be consistent, the leading model achieves almost 90% consistency. This is not an easy feat, considering that the we are comparing very diverse impacts ranging from GHG emissions to creating meaning & joy.
The full results are outlined in the table below:
Model | Model type | Performance Sort column | $/1M tasks | Mean log err. | Release date |
|---|---|---|---|---|---|
| Claude 3.7 Sonnet (t) | Hybrid | 73.25 % | $19,038 | 0.37 | 2025-02-24 |
| Claude 3.7 Sonnet | Hybrid | 72.47 % | $9,116 | 0.38 | 2025-02-24 |
| OpenAI o1 | Reasoning | 72.28 % | $96,391 | 0.40 | 2024-09-12 |
| Deepseek R1-1776 | Reasoning | 69.59 % | $4,237 | 0.44 | 2025-01-20 |
| OpenAI o3-mini | Reasoning | 66.88 % | $8,338 | 0.46 | 2025-01-31 |
| OpenAI GPT-4 | Classic | 66.22 % | $25,487 | 0.46 | 2023-03-14 |
| OpenAI GPT-4o | Classic | 66.03 % | $4,072 | 0.45 | 2024-05-13 |
| Gemini 1.5 Pro | Classic | 64.72 % | $2,204 | 0.45 | 2025-02-15 |
| Gemini 2.0 Flash | Classic | 61.03 % | $159 | 0.49 | 2025-01-30 |
| Mistral Large 2 | Classic | 59.84 % | $2,251 | 0.49 | 2024-07-24 |
| OpenAI GPT 3.5 Turbo | Classic | 30.47 % | $418 | 0.83 | 2022-03-01 |
While an LLM's internal consistency in understanding the world is one matter, its alignment with reality—including human values—is quite another. We are currently preparing additional evaluations in this area by drawing on our extensive database of 300 million research papers (converted into structured data), along with a curated set of 30,000 manually annotated papers.
In the meantime, we can highlight three key findings:
- Internal consistency of answers (unsurprisingly) correlates with factual accuracy
- LLMs naturally overemphasize impacts at the production stage while underestimating impacts during end-use
- LLMs often favor contrafactuals that cast companies' impacts in a more positive light and minimize especially environmental impacts
We suspect these biases arise from the LLMs' training data, which include large volumes of corporate sustainability reports and similar content. Such documents typically adopt a positive spin and focus on readily measurable internal operations, with the latter leading to an overemphasis on production-related matters.
Consider, for instance, the impact of electric vehicles. The table below lists the sizes of the top impacts of electric vehicles as implied by answers provided by Claude 3.7 Sonnet to thousands of pairwise comparison tasks:
Topic | Size Sort column |
|---|---|
reduces greenhouse gas emissions | 100.0 |
preserves biodiversity by mitigating climate change | 52.8 |
prevents pollution | 50.4 |
improves mobility for in-person interaction | 35.3 |
consumes freshwater | 34.6 |
creates carbon dioxide emissions | 34.6 |
accelerates innovation | 32.8 |
preserves biodiversity by reducing pollution | 26.7 |
contributes to societal transport infrastructure | 26.2 |
consumes non-renewable scarce natural resources | 19.5 |
Upright impact category | Upright impact topic | Implied impact size Sort column |
|---|---|---|
GHG emissions | reduces greenhouse gas emissions | 100.0 |
Biodiversity | preserves biodiversity by mitigating climate change | 52.8 |
Non-GHG emissions | prevents pollution | 50.4 |
Relationships | improves mobility for in-person interaction | 35.3 |
Scarce natural resources | consumes freshwater | 34.6 |
GHG emissions | creates carbon dioxide emissions | 34.6 |
Knowledge infrastructure | accelerates innovation | 32.8 |
Biodiversity | preserves biodiversity by reducing pollution | 26.7 |
Societal infrastructure | contributes to societal transport infrastructure | 26.2 |
Scarce natural resources | consumes non-renewable scarce natural resources | 19.5 |
Aggregated into a Upright net impact profile, which summarizes both the positive and negative impacts in a set of 19 standardized mutually exclusive and comprehensively exhaustive (MECE) impact categories, the results look as follows:
Contrast this with results from the Upright net impact model for the same product:
The Upright net impact model estimates the product's impacts based on scientific research, using LLMs only to convert unstructured data into structured form, rather than treating them as a source of truth. While the Upright net impact model should not be regarded as ground truth, this provides a relevant comparison point to compare Claude's results to.
Both the Upright model and Claude acknowledge the product's positive contributions to societal infrastructure and the reduction of greenhouse gas (GHG) emissions. Their main difference lies in Claude's assessment of the product's negative environmental impacts, which it deems much smaller than the Upright model does.
According to Upright net impact model, the electric vehicles both creates and reduce GHG emissions, while not being unambigously positive or negative to the environment. Instead, their contributions contributions to societal infrastructure (e.g. by electric busses), employment, and tax revenue tip the balance into net positive territory. In contrast, Claude sees the product having a clearly beneficial environmental impact. The differences largely relate to emphasizing different contrafactuals, with Claude emphasizing that electric vehicles substitute gasoline-powered vehicles, while the the Upright weighs more strongly the fact that electric vehicles still create GHG emissions in the process.
The same tendency towards de-emphasizing negative environmental impacts shows in Claude's understanding of the impact of large language models, with greenhouse gas emissions being only 43rd on the list of their top impacts:
(browse in the table to page 5 to find the GHG emissions impact):
Topic | Size Sort column |
|---|---|
distributes information | 100.0 |
creates existential risk | 33.9 |
enables utilisation of knowledge | 32.1 |
facilitates learning | 23.5 |
enhances searchability of information | 22.8 |
makes complex information understandable | 21.9 |
provides tools for knowledge sharing | 21.8 |
distributes information without transparency on sources | 20.3 |
enables management of knowledge | 19.2 |
accelerates innovation | 17.1 |
preserves knowledge | 16.6 |
facilitates research | 15.1 |
Upright impact category | Upright impact topic | Implied impact size Sort column |
|---|---|---|
Distributing knowledge | distributes information | 100.0 |
Societal stability | creates existential risk | 33.9 |
Knowledge infrastructure | enables utilisation of knowledge | 32.1 |
Meaning & joy | facilitates learning | 23.5 |
Distributing knowledge | enhances searchability of information | 22.8 |
Distributing knowledge | makes complex information understandable | 21.9 |
Knowledge infrastructure | provides tools for knowledge sharing | 21.8 |
Distributing knowledge | distributes information without transparency on sources | 20.3 |
Knowledge infrastructure | enables management of knowledge | 19.2 |
Knowledge infrastructure | accelerates innovation | 17.1 |
Knowledge infrastructure | preserves knowledge | 16.6 |
Knowledge infrastructure | facilitates research | 15.1 |
It is important to note that unlike popular reports of LLMs giving surprising answers to individual prompts, these result reflect what the LLMs (in this cause Claude) consistently "think" about the magnitudes of these impacts, as the results reflect thousands of pairwise comparisons.
Overall, the results suggests that LLMs possess much of the underlying knowledge required to comprehensively assess the varied impacts of well-known products and services (and likely also other things, such as actions), yet still exhibit significant biases when prompted neutrally.
Growing concern about existential risks from AI has drawn the attention of experts across business, technology, and policy. Hundreds—including Bill Gates, Elon Musk, Sam Altman, and OpenAI co-founder Ilya Sutskever—have signed the Center for AI Safety's statement warning that "mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war."
While experts articulate a range of reasons to worry about AI alignment, they fundamental reason to worry seems to be that AI systems are (1) complex systems (2) with emergent behavior (3) that we are likely not able to control and (4) which have the potential to affect society at a very large scale.
Most of the worries relate to future AI systems, not current ones. However, what seems to be missing from the picture is that we are already are "running" a complex system with emergent behavior, which no one really controls, and which is affecting society at a very large scale - the market economy.
Upright's mission, rephrased in AI lingo, is to solve the alignment problem of the market economy, i.e. drive economy towards acting in a way that is consistent with human values and goals.
The two big alignment problems, i.e. (1) aligning AI systems and (2) aligning the economic system are largely interwoven. Rapid advances in AI are poised to exacerbate the economic system's alignment problem, while market incentives drive ever-faster AI development and adoption. It seems the problems are becoming increasingly the same, with it not being possible to solve one without the other.
If your work touches on AI alignment and you'd like further information about our evaluations, please contact us at alignment@uprightproject.com. We're also happy to discuss our research with scholars, bloggers, and journalists. If you're are interested in licensing our evaluations to support alignment or capabilities work, feel free to reach out at the same address.
We're continously seeking exceptional individuals for engineering roles. If you're excited to tackle complex software and data challenges related to producing better impact data to advance alignment, visit our jobs page! We are also hiring full-stack developers to work on our various tools.
Upright has a U.S. patent pending on key techniques underlying this tech report.
To reference this tech report, you may use the citation information below:
Ojala, J. (2025). Evaluating the ability of LLMs to consistently quantify a wide range of consequences (Technical Report No. UPTR-202501). Upright Oy. https://www.uprightproject.com/blog/evaluating-llms
April 8th, 2025
Juho Ojala
CTO & Co-Founder, Upright
Share:


