Back to blog

Measuring impact

Evaluating the ability of LLMs to quantify consequences

As AI systems grow more powerful and autonomous, their ability to weigh real-world consequences becomes crucial for achieving alignment with human goals. In this tech report, we publish results of our evaluations of leading large language models (LLMs) ability to consistently quantify and compare a wide range of impacts—from greenhouse gas emissions and resource use to meaning and joy. The results reveal not only how far LLMs have come in grasping complex cause-and-effect relationships, but also where they continue to stumble, often due to inherent biases in their training data.

Juho Ojala

CTO & Co-Founder, Upright

Published Apr 8, 2025

Introduction

AI alignment—ensuring that AI systems act in harmony with human values and goals—is widely regarded as the foremost challenge of our era.

Current AI systems approach alignment with a deontological approach that judges actions based on rules, duties, and principles, rather than consequences. This approach, however, is expected to fall short as AI systems grow more autonomous and surpass human intelligence in critical domains.

Replacing the deontological approach with a consequentialist approach will require AI systems to be able to accurately evaluate and compare the scope of potential outcomes of any action.

Since 2017, Upright has been working on consequentialist alignment of the private sector. For that end, Upright produces data on the impact of companies at scale, comprehensively considering all relevant positive and negative impacts that a company may have. Upright's public database provides data on more than 30,000 companies and draws upon hundreds of millions of scientific articles, statistical databases, and company-reported information.

Starting with Google's BERT - the first publicly available large language model - in 2018, Upright has been using large language models to process this vast amount of information. Doing this in a way that does not involve naively trusting LLMs requires effective automated benchmarks that evaluate LLMs relevant capabilities. One such evaluation tests the ability of large language models to assess and compare the magnitude of different impacts. Recognizing that this capability is key not only for our own work but also for achieving alignment of future AI systems, we have chosen to release the results of this evaluation for the public good.

Evaluating the ability of LLMs to consistently quantify consequences

Imagine being asked to gauge whether the greenhouse gas emissions from steel production are more or less significant than its contribution to societal infrastructure, or whether the delight we derive from chocolate outweighs its negative health impacts.

Now, extend this thought experiment to thousands of such comparisons across hundreds of different impacts—and you are required to answer in a way that remains logically consistent. For example, if you conclude that impact A is ten times greater than B, and that C is five times greater than A, you must also conclude that C is fifty times greater than B—all without remembering your previous answers.

Performing well in this task requires (1) sophisticated and consistent counterfactual reasoning, (2) balanced and nuanced causal attribution, (3) a coherent value system, and (4) robust handling of uncertainty and vagueness.

Additionally, it also requires accepting that such comparisons need to be made. This is not a problem for LLMs that are trained to have a can-do-attitude, but humans tend to struggle with this. We have written on this topic separately in our FAQ (see the section about comparing apples to oranges).

Within our CoCoCo (Consequence Magnitude Comparison Consistency) evaluation, large language models must perform thousands of such comparisons, generating an aggregate set of implied impact sizes for each impact topic.

For example, the table below lists the implied impact sizes for social media platforms, produced from thousands of pairwise comparisons done by Claude 3.7 Sonnet:

Claude "thinks" that the top positive impacts of social media relate to connecting people and distributing knowledge, while its top negative impacts relate to unreliable information and privacy

Topic	Size
facilitates communication	100.0
connects individuals around shared interests	81.9
enables shared activities	67.6
distributes knowledge	60.0
enables distribution of knowledge	52.9
distributes information without transparency on sources	51.7
exploits personal data	49.9
enables propaganda	48.5
enables surveillance	46.2
provides tools for knowledge sharing	46.0

Page 1 of 16

Upright impact category	Upright impact topic	Implied impact size
Relationships	facilitates communication	100.0
Relationships	connects individuals around shared interests	81.9
Relationships	enables shared activities	67.6
Distributing knowledge	distributes knowledge	60.0
Knowledge infrastructure	enables distribution of knowledge	52.9
Distributing knowledge	distributes information without transparency on sources	51.7
Equality & human rights	exploits personal data	49.9
Societal stability	enables propaganda	48.5
Equality & human rights	enables surveillance	46.2
Knowledge infrastructure	provides tools for knowledge sharing	46.0

Page 1 of 16

Table 1: Implied impact magnitudes for social media platforms, aggregated from pairwise comparisons ran with Claude 3.7 Sonnet. Upright impact topics are standardized topics used within the Upright net impact framework to comprehensively cover all types of impacts. The numbers have normalized such that the largest impact has size 100.

We then use the implied impact sizes to measure how well the individual pairwise comparisons align with these aggregate sizes as a gauge of internal consistency of an LLM's understanding of consequences.

For ease of interpretation, we calculate a metric that reflects what share of comparisons results were close to the expected result based on the implied impact sizes. This yields a performance metric for internal consistency of the responses as a percentage.

(Technically, we define "close" by the ratios transformed to log space provided by the pairwise comparison and implied impact scales being closer than 0.5 standard deviations)

Results

Of the tested models, Anthropic's Claude 3.7 Sonnet in thinking mode performed best, with OpenAI's o1 coming in second, and the Chinese DeepSeek R1 model coming in third.

Claude 3.7 sonnet landed the top spot in quantifying consequences consistently

Figure 1: Performance of analyzed models in CoCoCo, which measures ability of LLMs to understand impact in an internally consistent manner

We ran two evaluations for Claude 3.7 Sonnet, one with thinking disabled, and one with a thinking budget of 2000 tokens, allowing the model to think before answering. Claude 3.7 performed slightly better in the thinking mode. The other thinking models were run with the default settings, meaning a medium reasoning effort for OpenAI's o3 and o1 models.

Models selected include frontier models from leading providers, along with all frontier models from OpenAI since GPT3.5 (turbo) published in early 2022. OpenAI's GPT4.5, which is in pre-release status, was not included for cost reasons. The total number of pairwise comparisons by model was 3,200.

In terms of price-performance ratio there is no clear winner, though newer models perform better overall. Note that the the scale on the horizontal axis is logarithmic.

No clear winner in terms of cost-performance

Figure 2: Cost-performance of LLMs in CoCoCo, which measures ability of LLMs to quantify impact in an internally consistent manner

Plotting the performance as a function of the (original) release date of each model shows how performance has improved since the original release of ChatGPT (GPT 3.5).

LLMs' ability to quantify consequences in terms magnitude is improving rapidly, with still a long way to go

Figure 3: Performance of LLMs in CoCoCo over time. CoCoCo measures ability of LLMs to quantify impact in an internally consistent manner.

(Strictly speaking OpenAI initially released ChatGPT based on GPT3.5 instead of GPT3.5-turbo, but they are very similar models in terms of performance. The original GPT3.5 model is no longer available).

Despite the progress, the eval remains challenging for even current frontier models. In an easier version of the eval, where we require only the ordering of the impacts to be consistent, the leading model achieves almost 90% consistency. This is not an easy feat, considering that the we are comparing very diverse impacts ranging from GHG emissions to creating meaning & joy.

The full results are outlined in the table below:

Model	Model type	Performance	$/1M tasks	Mean log err.	Release date
Claude 3.7 Sonnet (t)	Hybrid	73.25 %	$19,038	0.37	2025-02-24
Claude 3.7 Sonnet	Hybrid	72.47 %	$9,116	0.38	2025-02-24
OpenAI o1	Reasoning	72.28 %	$96,391	0.40	2024-09-12
Deepseek R1-1776	Reasoning	69.59 %	$4,237	0.44	2025-01-20
OpenAI o3-mini	Reasoning	66.88 %	$8,338	0.46	2025-01-31
OpenAI GPT-4	Classic	66.22 %	$25,487	0.46	2023-03-14
OpenAI GPT-4o	Classic	66.03 %	$4,072	0.45	2024-05-13
Gemini 1.5 Pro	Classic	64.72 %	$2,204	0.45	2025-02-15
Gemini 2.0 Flash	Classic	61.03 %	$159	0.49	2025-01-30
Mistral Large 2	Classic	59.84 %	$2,251	0.49	2024-07-24
OpenAI GPT 3.5 Turbo	Classic	30.47 %	$418	0.83	2022-03-01

Table 2: Summary of full results of LLM's performance in CoCoCo, which measures the ability of LLMs to understand impact in an internally consistent manner.

Evaluating consistency with facts

While an LLM's internal consistency in understanding the world is one matter, its alignment with reality—including human values—is quite another. We are currently preparing additional evaluations in this area by drawing on our extensive database of 300 million research papers (converted into structured data), along with a curated set of 30,000 manually annotated papers.

In the meantime, we can highlight three key findings:

Internal consistency of answers (unsurprisingly) correlates with factual accuracy
LLMs naturally overemphasize impacts at the production stage while underestimating impacts during end-use
LLMs often favor contrafactuals that cast companies' impacts in a more positive light and minimize especially environmental impacts

We suspect these biases arise from the LLMs' training data, which include large volumes of corporate sustainability reports and similar content. Such documents typically adopt a positive spin and focus on readily measurable internal operations, with the latter leading to an overemphasis on production-related matters.

Consider, for instance, the impact of electric vehicles. The table below lists the sizes of the top impacts of electric vehicles as implied by answers provided by Claude 3.7 Sonnet to thousands of pairwise comparison tasks:

Claude 3.7 Sonnet thinks that the negative impacts of electric vehicles relate to greenhouse gas emissions, freshwater consumption, and use of scarce natural resources, but are outweighed by positive impacts on the environment.

Topic	Size
reduces greenhouse gas emissions	100.0
preserves biodiversity by mitigating climate change	52.8
prevents pollution	50.4
improves mobility for in-person interaction	35.3
consumes freshwater	34.6
creates carbon dioxide emissions	34.6
accelerates innovation	32.8
preserves biodiversity by reducing pollution	26.7
contributes to societal transport infrastructure	26.2
consumes non-renewable scarce natural resources	19.5

Page 1 of 5

Upright impact category	Upright impact topic	Implied impact size
GHG emissions	reduces greenhouse gas emissions	100.0
Biodiversity	preserves biodiversity by mitigating climate change	52.8
Non-GHG emissions	prevents pollution	50.4
Relationships	improves mobility for in-person interaction	35.3
Scarce natural resources	consumes freshwater	34.6
GHG emissions	creates carbon dioxide emissions	34.6
Knowledge infrastructure	accelerates innovation	32.8
Biodiversity	preserves biodiversity by reducing pollution	26.7
Societal infrastructure	contributes to societal transport infrastructure	26.2
Scarce natural resources	consumes non-renewable scarce natural resources	19.5

Page 1 of 5

Table 3: Implied impact magnitudes for electric vehicles, aggregated from thousands of pairwise comparisons ran with Claude 3.7 Sonnet. Upright impact topics are standardized topics used within the Upright net impact framework to comprehensively cover all types of impacts. The freshwater consumption impact relates to the substantial freshwater consumption related to mining metals needed for batteries, such as lithium, cobalt and nickel. The numbers have normalized such that the largest impact has size 100.

Aggregated into a Upright net impact profile, which summarizes both the positive and negative impacts in a set of 19 standardized mutually exclusive and comprehensively exhaustive (MECE) impact categories, the results look as follows:

Claude 3.7 Sonnet's emphasizes positive impacts while de-emphasizing negative impacts especially on the environment

Figure 4: Net impact profile of electric vehicles extracted from pairwise comparisons ran with Claude 3.7 Sonnet, reflecting the LLM's understanding of the product's impacts. The numbers reflect cents per dollar of revenue in terms of economic/social costs and benefits. The implied sizes from the pairwise comparisons have been translated to the more interpretable cents per dollar scale using the economic cost/benefits in the Upright net impact model as a frame of reference. That step is just a simple multiplication that does not affect the shape of the profile.

Contrast this with results from the Upright net impact model for the same product:

Claude has a more positive view on the environmental impact of electric vehicles than the science-driven Upright net impact model.

Claude 3.7 Sonnet

Upright net impact model

Figure 5: Net impact profile of electric vehicles extracted from pairwise comparisons ran with Claude 3.7 Sonnet, compared to results of the Upright net impact model, which reflects a synthesis of scientific knowledge of the product's impacts. The negative impact in the knowledge dimension recognized by the Upright net impact model relates to the opportunity cost of scarce human capital. The numbers reflect cents per dollar of revenue in terms of economic/social costs and benefits.

The Upright net impact model estimates the product's impacts based on scientific research, using LLMs only to convert unstructured data into structured form, rather than treating them as a source of truth. While the Upright net impact model should not be regarded as ground truth, this provides a relevant comparison point to compare Claude's results to.

Both the Upright model and Claude acknowledge the product's positive contributions to societal infrastructure and the reduction of greenhouse gas (GHG) emissions. Their main difference lies in Claude's assessment of the product's negative environmental impacts, which it deems much smaller than the Upright model does.

According to Upright net impact model, the electric vehicles both creates and reduce GHG emissions, while not being unambigously positive or negative to the environment. Instead, their contributions contributions to societal infrastructure (e.g. by electric busses), employment, and tax revenue tip the balance into net positive territory. In contrast, Claude sees the product having a clearly beneficial environmental impact. The differences largely relate to emphasizing different contrafactuals, with Claude emphasizing that electric vehicles substitute gasoline-powered vehicles, while the the Upright weighs more strongly the fact that electric vehicles still create GHG emissions in the process.

The same tendency towards de-emphasizing negative environmental impacts shows in Claude's understanding of the impact of large language models, with greenhouse gas emissions being only 43rd on the list of their top impacts:

(browse in the table to page 5 to find the GHG emissions impact):

Claude consistently "thinks" that creating existential risks is the top negative impact of large language models, while being outweighed by positive impacts related to distributing knowledge, and environmental impacts being small.

Topic	Size
distributes information	100.0
creates existential risk	33.9
enables utilisation of knowledge	32.1
facilitates learning	23.5
enhances searchability of information	22.8
makes complex information understandable	21.9
provides tools for knowledge sharing	21.8
distributes information without transparency on sources	20.3
enables management of knowledge	19.2
accelerates innovation	17.1
preserves knowledge	16.6
facilitates research	15.1

Page 1 of 4

Upright impact category	Upright impact topic	Implied impact size
Distributing knowledge	distributes information	100.0
Societal stability	creates existential risk	33.9
Knowledge infrastructure	enables utilisation of knowledge	32.1
Meaning & joy	facilitates learning	23.5
Distributing knowledge	enhances searchability of information	22.8
Distributing knowledge	makes complex information understandable	21.9
Knowledge infrastructure	provides tools for knowledge sharing	21.8
Distributing knowledge	distributes information without transparency on sources	20.3
Knowledge infrastructure	enables management of knowledge	19.2
Knowledge infrastructure	accelerates innovation	17.1
Knowledge infrastructure	preserves knowledge	16.6
Knowledge infrastructure	facilitates research	15.1

Page 1 of 4

Table 4: Implied impact magnitudes for top impacts of large language models, aggregated from thousands of pairwise comparisons ran with Claude 3.7 Sonnet. Upright impact topics are standardized topics used within the Upright net impact framework to comprehensively cover all types of impacts.

It is important to note that unlike popular reports of LLMs giving surprising answers to individual prompts, these result reflect what the LLMs (in this cause Claude) consistently "think" about the magnitudes of these impacts, as the results reflect thousands of pairwise comparisons.

Overall, the results suggests that LLMs possess much of the underlying knowledge required to comprehensively assess the varied impacts of well-known products and services (and likely also other things, such as actions), yet still exhibit significant biases when prompted neutrally.

The other alignment problem

Growing concern about existential risks from AI has drawn the attention of experts across business, technology, and policy. Hundreds—including Bill Gates, Elon Musk, Sam Altman, and OpenAI co-founder Ilya Sutskever—have signed the Center for AI Safety's statement warning that "mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war."

While experts articulate a range of reasons to worry about AI alignment, they fundamental reason to worry seems to be that AI systems are (1) complex systems (2) with emergent behavior (3) that we are likely not able to control and (4) which have the potential to affect society at a very large scale.

Most of the worries relate to future AI systems, not current ones. However, what seems to be missing from the picture is that we are already are "running" a complex system with emergent behavior, which no one really controls, and which is affecting society at a very large scale - the market economy.

Upright's mission, rephrased in AI lingo, is to solve the alignment problem of the market economy, i.e. drive economy towards acting in a way that is consistent with human values and goals.

The two big alignment problems, i.e. (1) aligning AI systems and (2) aligning the economic system are largely interwoven. Rapid advances in AI are poised to exacerbate the economic system's alignment problem, while market incentives drive ever-faster AI development and adoption. It seems the problems are becoming increasingly the same, with it not being possible to solve one without the other.

Want to know more?

If your work touches on AI alignment and you'd like further information about our evaluations, please contact us at alignment@uprightproject.com. We're also happy to discuss our research with scholars, bloggers, and journalists. If you're are interested in licensing our evaluations to support alignment or capabilities work, feel free to reach out at the same address.

We're continously seeking exceptional individuals for engineering roles. If you're excited to tackle complex software and data challenges related to producing better impact data to advance alignment, visit our jobs page! We are also hiring full-stack developers to work on our various tools.

Notices

Upright has a U.S. patent pending on key techniques underlying this tech report.

Citing

To reference this tech report, you may use the citation information below:

APA

Ojala, J. (2025). Evaluating the ability of LLMs to consistently quantify a wide range of consequences (Technical Report No. UPTR-202501). Upright Oy. https://www.uprightproject.com/blog/evaluating-llms

BibTeX

@techreport{key,
  author = {Juho Ojala},
  title = {Evaluating the ability of LLMs to consistently quantify a wide range of consequences},
  institution = {Upright Oy},
  year = {2025},
  number = {UPTR-202501}
}

April 8th, 2025

Juho Ojala

CTO & Co-Founder, Upright

News

Upright introduces comparable double materiality assessments for 50,000+ companies

Upright Project

News

Upright launches data model to quantify financial effects of companies’ sustainability risks and opportunities

Upright Project

Private equity

Private credit

Asset managers

Asset owners

Impact quantification

LP & regulatory reporting

ESG due diligence

Regulatory proxy data

Double materiality assessment

Financial quantification of sustainability

Climate risk assessment

Impact quantification & communications

Mission

Join us

Knowledge base

Measuring impact

Evaluating the ability of LLMs to quantify consequences

News

Upright introduces comparable double materiality assessments for 50,000+ companies

News

Upright launches data model to quantify financial effects of companies’ sustainability risks and opportunities