OpenAI’s GDPval Benchmark: Evaluating AI’s Potential in Professional Industries

OpenAI has introduced a groundbreaking benchmark known as GDPval, designed to test how its AI models perform compared to human professionals across a diverse range of industries and occupations. This initiative is a crucial step in OpenAI’s mission to develop artificial general intelligence (AGI) capable of performing economically valuable work as well as humans.

The GDPval benchmark encompasses nine key industries that significantly contribute to the United States’ gross domestic product, including sectors like healthcare, finance, manufacturing, and government. Within these industries, the benchmark evaluates AI performance in 44 different occupations, ranging from software engineering to nursing and journalism.

In its initial version, GDPval-v0, OpenAI involved experienced professionals to compare AI-generated reports with those crafted by humans, assessing which was superior. For instance, investment bankers were tasked with creating a competitor landscape for the last-mile delivery industry, which was then compared to AI-generated reports. The results showed that OpenAI’s enhanced GPT-5 model was on par with or better than industry experts in 40.6% of cases, while Anthropic’s Claude Opus 4.1 scored 49% in similar evaluations.

Despite these promising results, OpenAI acknowledges the limitations of GDPval-v0, as it currently focuses solely on report generation, a small subset of tasks professionals perform. The company plans to develop more comprehensive tests that account for a broader range of industries and interactive workflows, ultimately providing a more robust measure of AI’s capabilities.

The progress seen in GDPval highlights the potential for AI models to assist professionals by enabling them to focus on more meaningful and higher-value tasks. As AI capabilities continue to improve, these tools can become invaluable assets in enhancing productivity and innovation across various sectors.

OpenAI’s efforts in benchmarking AI models are part of a broader movement within Silicon Valley to evaluate AI’s state-of-the-art status. With benchmarks like AIME 2025 and GPQA Diamond already in place for specific skills, GDPval seeks to measure AI’s proficiency in real-world tasks, potentially setting a new standard for AI evaluation in professional settings.

As OpenAI continues to refine GDPval, the benchmark could play a pivotal role in demonstrating the value of AI models across diverse industries, potentially redefining the future of work as we know it.