Sign in
Measuring impact

Evaluating the ability of LLMs to quantify consequences

As AI systems grow more powerful and autonomous, their ability to weigh real-world consequences becomes crucial for achieving alignment with human goals. In this tech report, we publish results of our evaluations of leading large language models (LLMs) ability to consistently quantify and compare a wide range of impacts—from greenhouse gas emissions and resource use to meaning and joy. The results reveal not only how far LLMs have come in grasping complex cause-and-effect relationships, but also where they continue to stumble, often due to inherent biases in their training data.

Juho Ojala

Juho Ojala

CTO & Co-Founder, Upright

Published Apr 8, 2025

Introduction

AI alignment—ensuring that AI systems act in harmony with human values and goals—is widely regarded as the foremost challenge of our era.

Current AI systems approach alignment with a deontological approach that judges actions based on rules, duties, and principles, rather than consequences. This approach, however, is expected to fall short as AI systems grow more autonomous and surpass human intelligence in critical domains.

Replacing the deontological approach with a consequentialist approach will require AI systems to be able to accurately evaluate and compare the scope of potential outcomes of any action.

Since 2017, Upright has been working on consequentialist alignment of the private sector. For that end, Upright produces data on the impact of companies at scale, comprehensively considering all relevant positive and negative impacts that a company may have. Upright's public database provides data on more than 30,000 companies and draws upon hundreds of millions of scientific articles, statistical databases, and company-reported information.

Starting with Google's BERT - the first publicly available large language model - in 2018, Upright has been using large language models to process this vast amount of information. Doing this in a way that does not involve naively trusting LLMs requires effective automated benchmarks that evaluate LLMs relevant capabilities. One such evaluation tests the ability of large language models to assess and compare the magnitude of different impacts. Recognizing that this capability is key not only for our own work but also for achieving alignment of future AI systems, we have chosen to release the results of this evaluation for the public good.

Evaluating the ability of LLMs to consistently quantify consequences

Imagine being asked to gauge whether the greenhouse gas emissions from steel production are more or less significant than its contribution to societal infrastructure, or whether the delight we derive from chocolate outweighs its negative health impacts.

Now, extend this thought experiment to thousands of such comparisons across hundreds of different impacts—and you are required to answer in a way that remains logically consistent. For example, if you conclude that impact A is ten times greater than B, and that C is five times greater than A, you must also conclude that C is fifty times greater than B—all without remembering your previous answers.

Performing well in this task requires (1) sophisticated and consistent counterfactual reasoning, (2) balanced and nuanced causal attribution, (3) a coherent value system, and (4) robust handling of uncertainty and vagueness.

Additionally, it also requires accepting that such comparisons need to be made. This is not a problem for LLMs that are trained to have a can-do-attitude, but humans tend to struggle with this. We have written on this topic separately in our FAQ (see the section about comparing apples to oranges).

Within our CoCoCo (Consequence Magnitude Comparison Consistency) evaluation, large language models must perform thousands of such comparisons, generating an aggregate set of implied impact sizes for each impact topic.

For example, the table below lists the implied impact sizes for social media platforms, produced from thousands of pairwise comparisons done by Claude 3.7 Sonnet:

 
Claude "thinks" that the top positive impacts of social media relate to connecting people and distributing knowledge, while its top negative impacts relate to unreliable information and privacy
Topic
Size
facilitates communication
100.0
connects individuals around shared interests
81.9
enables shared activities
67.6
distributes knowledge
60.0
enables distribution of knowledge
52.9
distributes information without transparency on sources
51.7
exploits personal data
49.9
enables propaganda
48.5
enables surveillance
46.2
provides tools for knowledge sharing
46.0
Page 1 of 16
Upright impact category
Upright impact topic
Implied impact size
Relationships
facilitates communication
100.0
Relationships
connects individuals around shared interests
81.9
Relationships
enables shared activities
67.6
Distributing knowledge
distributes knowledge
60.0
Knowledge infrastructure
enables distribution of knowledge
52.9
Distributing knowledge
distributes information without transparency on sources
51.7
Equality & human rights
exploits personal data
49.9
Societal stability
enables propaganda
48.5
Equality & human rights
enables surveillance
46.2
Knowledge infrastructure
provides tools for knowledge sharing
46.0
Page 1 of 16
Table 1: Implied impact magnitudes for social media platforms, aggregated from pairwise comparisons ran with Claude 3.7 Sonnet. Upright impact topics are standardized topics used within the Upright net impact framework to comprehensively cover all types of impacts. The numbers have normalized such that the largest impact has size 100.

We then use the implied impact sizes to measure how well the individual pairwise comparisons align with these aggregate sizes as a gauge of internal consistency of an LLM's understanding of consequences.

For ease of interpretation, we calculate a metric that reflects what share of comparisons results were close to the expected result based on the implied impact sizes. This yields a performance metric for internal consistency of the responses as a percentage.

(Technically, we define "close" by the ratios transformed to log space provided by the pairwise comparison and implied impact scales being closer than 0.5 standard deviations)

Results

Of the tested models, Anthropic's Claude 3.7 Sonnet in thinking mode performed best, with OpenAI's o1 coming in second, and the Chinese DeepSeek R1 model coming in third.

 
Claude 3.7 sonnet landed the top spot in quantifying consequences consistently
Figure 1: Performance of analyzed models in CoCoCo, which measures ability of LLMs to understand impact in an internally consistent manner

We ran two evaluations for Claude 3.7 Sonnet, one with thinking disabled, and one with a thinking budget of 2000 tokens, allowing the model to think before answering. Claude 3.7 performed slightly better in the thinking mode. The other thinking models were run with the default settings, meaning a medium reasoning effort for OpenAI's o3 and o1 models.

Models selected include frontier models from leading providers, along with all frontier models from OpenAI since GPT3.5 (turbo) published in early 2022. OpenAI's GPT4.5, which is in pre-release status, was not included for cost reasons. The total number of pairwise comparisons by model was 3,200.

In terms of price-performance ratio there is no clear winner, though newer models perform better overall. Note that the the scale on the horizontal axis is logarithmic.

 
No clear winner in terms of cost-performance
Figure 2: Cost-performance of LLMs in CoCoCo, which measures ability of LLMs to quantify impact in an internally consistent manner

Plotting the performance as a function of the (original) release date of each model shows how performance has improved since the original release of ChatGPT (GPT 3.5).

 
LLMs' ability to quantify consequences in terms magnitude is improving rapidly, with still a long way to go
Figure 3: Performance of LLMs in CoCoCo over time. CoCoCo measures ability of LLMs to quantify impact in an internally consistent manner.

(Strictly speaking OpenAI initially released ChatGPT based on GPT3.5 instead of GPT3.5-turbo, but they are very similar models in terms of performance. The original GPT3.5 model is no longer available).

Despite the progress, the eval remains challenging for even current frontier models. In an easier version of the eval, where we require only the ordering of the impacts to be consistent, the leading model achieves almost 90% consistency. This is not an easy feat, considering that the we are comparing very diverse impacts ranging from GHG emissions to creating meaning & joy.

The full results are outlined in the table below:

 
Model
Model type
Performance
$/1M tasks
Mean log err.
Release date
Claude 3.7 Sonnet (t)Hybrid73.25 %$19,0380.372025-02-24
Claude 3.7 SonnetHybrid72.47 %$9,1160.382025-02-24
OpenAI o1Reasoning72.28 %$96,3910.402024-09-12
Deepseek R1-1776Reasoning69.59 %$4,2370.442025-01-20
OpenAI o3-miniReasoning66.88 %$8,3380.462025-01-31
OpenAI GPT-4Classic66.22 %$25,4870.462023-03-14
OpenAI GPT-4oClassic66.03 %$4,0720.452024-05-13
Gemini 1.5 ProClassic64.72 %$2,2040.452025-02-15
Gemini 2.0 FlashClassic61.03 %$1590.492025-01-30
Mistral Large 2Classic59.84 %$2,2510.492024-07-24
OpenAI GPT 3.5 TurboClassic30.47 %$4180.832022-03-01
Table 2: Summary of full results of LLM's performance in CoCoCo, which measures the ability of LLMs to understand impact in an internally consistent manner.
Evaluating consistency with facts

While an LLM's internal consistency in understanding the world is one matter, its alignment with reality—including human values—is quite another. We are currently preparing additional evaluations in this area by drawing on our extensive database of 300 million research papers (converted into structured data), along with a curated set of 30,000 manually annotated papers.

In the meantime, we can highlight three key findings:

  1. Internal consistency of answers (unsurprisingly) correlates with factual accuracy
  2. LLMs naturally overemphasize impacts at the production stage while underestimating impacts during end-use
  3. LLMs often favor contrafactuals that cast companies' impacts in a more positive light and minimize especially environmental impacts

We suspect these biases arise from the LLMs' training data, which include large volumes of corporate sustainability reports and similar content. Such documents typically adopt a positive spin and focus on readily measurable internal operations, with the latter leading to an overemphasis on production-related matters.

Consider, for instance, the impact of electric vehicles. The table below lists the sizes of the top impacts of electric vehicles as implied by answers provided by Claude 3.7 Sonnet to thousands of pairwise comparison tasks:

 
Claude 3.7 Sonnet thinks that the negative impacts of electric vehicles relate to greenhouse gas emissions, freshwater consumption, and use of scarce natural resources, but are outweighed by positive impacts on the environment.
Topic
Size
reduces greenhouse gas emissions
100.0
preserves biodiversity by mitigating climate change
52.8
prevents pollution
50.4
improves mobility for in-person interaction
35.3
consumes freshwater
34.6
creates carbon dioxide emissions
34.6
accelerates innovation
32.8
preserves biodiversity by reducing pollution
26.7
contributes to societal transport infrastructure
26.2
consumes non-renewable scarce natural resources
19.5
Page 1 of 5
Upright impact category
Upright impact topic
Implied impact size
GHG emissions
reduces greenhouse gas emissions
100.0
Biodiversity
preserves biodiversity by mitigating climate change
52.8
Non-GHG emissions
prevents pollution
50.4
Relationships
improves mobility for in-person interaction
35.3
Scarce natural resources
consumes freshwater
34.6
GHG emissions
creates carbon dioxide emissions
34.6
Knowledge infrastructure
accelerates innovation
32.8
Biodiversity
preserves biodiversity by reducing pollution
26.7
Societal infrastructure
contributes to societal transport infrastructure
26.2
Scarce natural resources
consumes non-renewable scarce natural resources
19.5
Page 1 of 5
Table 3: Implied impact magnitudes for electric vehicles, aggregated from thousands of pairwise comparisons ran with Claude 3.7 Sonnet. Upright impact topics are standardized topics used within the Upright net impact framework to comprehensively cover all types of impacts. The freshwater consumption impact relates to the substantial freshwater consumption related to mining metals needed for batteries, such as lithium, cobalt and nickel. The numbers have normalized such that the largest impact has size 100.

Aggregated into a Upright net impact profile, which summarizes both the positive and negative impacts in a set of 19 standardized mutually exclusive and comprehensively exhaustive (MECE) impact categories, the results look as follows:

 
Claude 3.7 Sonnet's emphasizes positive impacts while de-emphasizing negative impacts especially on the environment
Figure 4: Net impact profile of electric vehicles extracted from pairwise comparisons ran with Claude 3.7 Sonnet, reflecting the LLM's understanding of the product's impacts. The numbers reflect cents per dollar of revenue in terms of economic/social costs and benefits. The implied sizes from the pairwise comparisons have been translated to the more interpretable cents per dollar scale using the economic cost/benefits in the Upright net impact model as a frame of reference. That step is just a simple multiplication that does not affect the shape of the profile.

Contrast this with results from the Upright net impact model for the same product:

 
Claude has a more positive view on the environmental impact of electric vehicles than the science-driven Upright net impact model.
Claude 3.7 Sonnet
Upright net impact model
Figure 5: Net impact profile of electric vehicles extracted from pairwise comparisons ran with Claude 3.7 Sonnet, compared to results of the Upright net impact model, which reflects a synthesis of scientific knowledge of the product's impacts. The negative impact in the knowledge dimension recognized by the Upright net impact model relates to the opportunity cost of scarce human capital. The numbers reflect cents per dollar of revenue in terms of economic/social costs and benefits.

The Upright net impact model estimates the product's impacts based on scientific research, using LLMs only to convert unstructured data into structured form, rather than treating them as a source of truth. While the Upright net impact model should not be regarded as ground truth, this provides a relevant comparison point to compare Claude's results to.

Both the Upright model and Claude acknowledge the product's positive contributions to societal infrastructure and the reduction of greenhouse gas (GHG) emissions. Their main difference lies in Claude's assessment of the product's negative environmental impacts, which it deems much smaller than the Upright model does.

According to Upright net impact model, the electric vehicles both creates and reduce GHG emissions, while not being unambigously positive or negative to the environment. Instead, their contributions contributions to societal infrastructure (e.g. by electric busses), employment, and tax revenue tip the balance into net positive territory. In contrast, Claude sees the product having a clearly beneficial environmental impact. The differences largely relate to emphasizing different contrafactuals, with Claude emphasizing that electric vehicles substitute gasoline-powered vehicles, while the the Upright weighs more strongly the fact that electric vehicles still create GHG emissions in the process.

The same tendency towards de-emphasizing negative environmental impacts shows in Claude's understanding of the impact of large language models, with greenhouse gas emissions being only 43rd on the list of their top impacts:

(browse in the table to page 5 to find the GHG emissions impact):

 
Claude consistently "thinks" that creating existential risks is the top negative impact of large language models, while being outweighed by positive impacts related to distributing knowledge, and environmental impacts being small.
Topic
Size
distributes information
100.0
creates existential risk
33.9
enables utilisation of knowledge
32.1
facilitates learning
23.5
enhances searchability of information
22.8
makes complex information understandable
21.9
provides tools for knowledge sharing
21.8
distributes information without transparency on sources
20.3
enables management of knowledge
19.2
accelerates innovation
17.1
preserves knowledge
16.6
facilitates research
15.1
Page 1 of 4
Upright impact category
Upright impact topic
Implied impact size
Distributing knowledge
distributes information
100.0
Societal stability
creates existential risk
33.9
Knowledge infrastructure
enables utilisation of knowledge
32.1
Meaning & joy
facilitates learning
23.5
Distributing knowledge
enhances searchability of information
22.8
Distributing knowledge
makes complex information understandable
21.9
Knowledge infrastructure
provides tools for knowledge sharing
21.8
Distributing knowledge
distributes information without transparency on sources
20.3
Knowledge infrastructure
enables management of knowledge
19.2
Knowledge infrastructure
accelerates innovation
17.1
Knowledge infrastructure
preserves knowledge
16.6
Knowledge infrastructure
facilitates research
15.1
Page 1 of 4
Table 4: Implied impact magnitudes for top impacts of large language models, aggregated from thousands of pairwise comparisons ran with Claude 3.7 Sonnet. Upright impact topics are standardized topics used within the Upright net impact framework to comprehensively cover all types of impacts.

It is important to note that unlike popular reports of LLMs giving surprising answers to individual prompts, these result reflect what the LLMs (in this cause Claude) consistently "think" about the magnitudes of these impacts, as the results reflect thousands of pairwise comparisons.

Overall, the results suggests that LLMs possess much of the underlying knowledge required to comprehensively assess the varied impacts of well-known products and services (and likely also other things, such as actions), yet still exhibit significant biases when prompted neutrally.

The other alignment problem

Growing concern about existential risks from AI has drawn the attention of experts across business, technology, and policy. Hundreds—including Bill Gates, Elon Musk, Sam Altman, and OpenAI co-founder Ilya Sutskever—have signed the Center for AI Safety's statement warning that "mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war."

While experts articulate a range of reasons to worry about AI alignment, they fundamental reason to worry seems to be that AI systems are (1) complex systems (2) with emergent behavior (3) that we are likely not able to control and (4) which have the potential to affect society at a very large scale.

Most of the worries relate to future AI systems, not current ones. However, what seems to be missing from the picture is that we are already are "running" a complex system with emergent behavior, which no one really controls, and which is affecting society at a very large scale - the market economy.

Upright's mission, rephrased in AI lingo, is to solve the alignment problem of the market economy, i.e. drive economy towards acting in a way that is consistent with human values and goals.

The two big alignment problems, i.e. (1) aligning AI systems and (2) aligning the economic system are largely interwoven. Rapid advances in AI are poised to exacerbate the economic system's alignment problem, while market incentives drive ever-faster AI development and adoption. It seems the problems are becoming increasingly the same, with it not being possible to solve one without the other.

Want to know more?

If your work touches on AI alignment and you'd like further information about our evaluations, please contact us at alignment@uprightproject.com. We're also happy to discuss our research with scholars, bloggers, and journalists. If you're are interested in licensing our evaluations to support alignment or capabilities work, feel free to reach out at the same address.

We're continously seeking exceptional individuals for engineering roles. If you're excited to tackle complex software and data challenges related to producing better impact data to advance alignment, visit our jobs page! We are also hiring full-stack developers to work on our various tools.

Notices

Upright has a U.S. patent pending on key techniques underlying this tech report.

Citing

To reference this tech report, you may use the citation information below:

APA

Ojala, J. (2025). Evaluating the ability of LLMs to consistently quantify a wide range of consequences (Technical Report No. UPTR-202501). Upright Oy. https://www.uprightproject.com/blog/evaluating-llms

BibTeX

April 8th, 2025

Juho Ojala

CTO & Co-Founder, Upright

Share:

LinkedIn logo
Juho Ojala