Stanford: Most Foundation Models Don’t Comply with EU’s AI Act
Among the findings: A surprisingly poor showing for a respected large language model
At a Glance
- Stanford HAI assessed 10 major foundation models and found that they "largely do not" comply with the EU AI Act.
- Top scorers were Hugging Face's BLOOM, EleutherAI's GPT-NeoX and Google's PaLM. Near the bottom was Anthropic's Claude.
- Stanford recommends several policy changes for lawmakers and model improvements for providers.
An investigation of 10 major foundation models shows that they “largely do not” comply with the EU’s AI Act, according to Stanford University’s renowned Institute for Human-Centered Artificial Intelligence (HAI).
The highest scoring foundation model was Bloom, the open source model from Hugging Face. Released in July 2022, Bloom is a large multilingual model with up to 176 billion parameters designed to be general purpose.
It was followed by EleutherAI’s GPT-NeoX and Google’s PaLM 2. OpenAI’s GPT-4 is a close fourth and Cohere’s Command rounds out the top five. The highest score possible is 48.
The lowest-scoring model was Luminous from German AI startup Aleph Alpha. Surprisingly, the second worst-performing model was Claude, the Anthropic-developed AI designed to generate safer responses. Third worst was Jurassic-2 from AI21.
Claude's results were indicative of restricted or closed models, according to Stanford researchers. They found that models like Claude, as well as Google’s PaLM 2 and OpenAI’s GPT-4 where little has been disclosed about them, made it difficult to assess data sources and compute levels. GPT-4 scored 25 while PaLM scored 27, although, like Claude, both ranked low when it came to tests covering data sources and governance.
Stanford researchers took the EU AI Act’s 22 requirements and chose 12 that can be “meaningfully evaluated” using public information. They are data sources, data governance, copyrighted data, compute, energy, capabilities/limitations, ricks/mitigations, evaluations, testing, machine-generated content, member states and downstream documentation.
These 12 metrics are assessed on a 4-point scale, with 1 being the lowest in compliance and 4 the highest. However, some models cannot be adequately assessed since the creators did not reveal enough information. For example, data sources are often not disclosed for closed models.
The researchers said they chose the EU AI Act because it is the “most important regulatory initiative on AI in the world today.” It will soon be law for the bloc’s 450 million people but also sets precedent for AI regulation around the world, which is known as the ‘Brussels effect.’
“Policymakers across the globe are already drawing inspiration from the AI Act, and multinational companies may change their global practices to maintain a single AI development process,” the team wrote.
Feasible to comply
However, models that scored the highest still have room for “significant improvement,” the researchers concluded. This means the EU AI Act will lead to “significant change” and “substantial progress” in transparency and accountability.
Four areas where most models struggled are copyrighted data (unclear liability issues), compute/energy (uneven reporting of energy use), risk mitigation (inadequate disclosures) and evaluation/testing (rarely do model providers measure performance in terms of intentional harms).
Generally, broadly open models are strong on resource disclosures but weaker on monitoring or controlling deployment. Closed or restricted models have the opposite issue.
As such, Stanford researchers are calling on the EU policymakers to strengthen deployment requirements to ensure greater accountability.
Stay updated. Subscribe to the AI Business newsletter
The good news is that it is feasible for many model providers to raise their compliance stores to the high 30s or 40s. “We conclude that enforcing these 12 requirements in the Act would bring substantive change while remaining within reach for providers.”
The researchers said requirements on foundation models, only added to the bill in May, bolsters transparency throughout the AI ecosystem.
“We see no significant barriers that would prevent every provider from improving how it discusses limitations and risks as well as reporting on standard benchmarks,” according to the report. “Although open-sourcing may make aspects of deployment disclosure challenging, feasible improvements in disclosure of machine-generated content or availability of downstream documentation abound.”
Stanford recommends policy changes
The EU AI Act is just one step away from coming into law. After overcoming a major hurdle in June following a Parliament vote, EU leaders just need to sign off on a definitive version of the bill before it passes into law.
The researchers said the EU AI Act must specify areas that are under-specified such as which dimensions of performance are necessary to disclose. Also, model accuracy, robustness, fairness, and efficiency must be considered for assessing compliance, similar to the U.S. Institute of Standards and Technology’s AI Risk Management Framework.
The EU AI Act should also force providers to disclose usage patterns to mirror transparency reporting for online platforms, the Stanford team said. “We need to understand how foundation models are used (such as to provide medical advice or preparing legal documents) to hold their providers to account.” This stipulation should only apply to the “most influential model providers,” they added.
For global policymakers, they should prioritize transparency of the models. The experience with social media regulation is a clear lesson that failing to have adequate platform transparency has led to many harms.
The area where model providers achieved the worst compliance was in the disclosure of copyrighted training data. The researchers called for legislators to clarify how copyright relates to training and the output of generative models., including the conditions under which machine-generated content infringes on the rights of content creators.
For model providers, Stanford researchers recommend they start by working on “low-hanging fruit” such as improving documentation for downstream developers that build on foundation models. They also should partner with academia and the public to develop industry standards to aid transparency and accountability in the overall ecosystem.
Read more about:
ChatGPT / Generative AIAbout the Authors
You May Also Like