After OpenAI launched the fourth generation of ChatGPT, the popularity of AI swept from the AI field to the broader tech community, and discussions about it emerged across various industries. In the face of such a phenomenon, I, who maintain a skeptical attitude towards the excitement, often think, "Is it really that impressive? What are the drawbacks?" This line of thought does not stem from a refusal to accept the changes it brings, but rather from a desire to understand what impact such large AI systems will have on the future, how we should face these short-term, medium-term, and long-term changes, so as to have a psychological expectation and also to plan ahead for the future.
Coincidentally, today I would like to use this article from Anthropic as an opportunity to think together with everyone about the safety issues faced by large AI models and their exploration of this issue.
Introduction#
We founded Anthropic because we believe that the impact of artificial intelligence could be comparable to that of the industrial and scientific revolutions, but we do not believe it will proceed smoothly. Moreover, we believe that this level of impact may come very soon—perhaps within the next decade.
This view may sound unbelievable or exaggerated, and there are ample reasons to be skeptical. On one hand, almost everyone who has said "what we are doing may be one of the biggest developments in history" has been wrong, often laughably so. Nevertheless, we believe there is enough evidence to seriously prepare for a world where rapid advancements in artificial intelligence lead to transformative AI systems.
At Anthropic, our motto is "show, don't tell," and we have been focused on releasing a steady stream of safety-oriented research, which we believe has broad value for the AI community. We are writing this article now because as more people become aware of advancements in AI, it is time to express our views on this topic and explain our strategies and goals. In short, we believe that AI safety research is urgent and should receive support from a wide range of public and private participants.
Therefore, in this article, we will summarize why we believe all of this: why we expect AI to advance very rapidly and have a very large impact, and how this leads us to be concerned about AI safety. Then, we will briefly summarize our own approach to AI safety research and some of the reasons behind it. We hope that by writing this article, we can contribute to a broader discussion about AI safety and AI progress.
As a high-level summary of the key points in this article:
-
AI will have a very large impact, possibly within the next decade.
The rapid and continuous advancement of AI systems is a predictable result of the exponential growth of computation used to train AI systems, as research on "scaling laws" indicates that more computation leads to a general increase in capabilities. Simple inferences suggest that AI systems will become more powerful over the next decade, with performance on most intellectual tasks potentially equaling or exceeding human levels. While the progress of AI may slow or stop, there is evidence to suggest it may continue. -
We do not know how to train systems to perform robustly well.
So far, no one knows how to train very powerful AI systems to be very useful, honest, and harmless. Additionally, the rapid advancement of AI will disrupt society and may trigger competitive races that lead companies or nations to deploy untrustworthy AI systems. The consequences of this could be catastrophic, either because AI systems strategically pursue dangerous goals or because these systems make more innocent mistakes in high-risk situations. -
We are most optimistic about our multifaceted, experience-driven approach to AI safety.
We are pursuing various research directions aimed at building reliably safe systems, with the most exciting being scalable oversight, mechanical interpretability, process-oriented learning, and understanding and evaluating how AI systems learn and generalize. One of our key goals is to accelerate this safety work in a differentiated manner and to develop a safety research profile that attempts to cover a wide range of scenarios, from those where safety challenges are proven to be easily solvable to those where creating safe systems is very difficult.
A Rough View of AI's Rapid Development#
The three main factors leading to predictable AI performance improvements are training data, computation, and improved algorithms. In the mid-2010s, some of us noticed that larger AI systems were consistently more intelligent, leading us to speculate that the most important factor in AI performance might be the total budget of computation for training AI. When plotting the data, it became clear that the amount of computation going into the largest models was growing at a rate of 10 times per year (with a doubling time 7 times faster than Moore's Law). In 2019, several members of what would later become the founding team of Anthropic made this idea more precise by formulating scaling laws for AI, demonstrating that you could make AI smarter in a predictable way simply by making them larger and training them on more data. These results partially validated this, as the team led the training work for GPT-3, arguably the first modern "large" language model (2), with over 173 billion parameters.
Since the discovery of scaling laws, many of us at Anthropic have believed that AI is likely to make very rapid progress. However, back in 2019, multimodality, logical reasoning, learning speed, cross-task transfer learning, and long-term memory seemed to pose potential "walls" that could slow or halt AI progress. In the years since, some of these "walls," such as multimodality and logical reasoning, have begun to crumble. Given this, most of us are increasingly confident that rapid progress in AI will continue rather than stagnate or regress. AI systems are now performing at levels close to human capability across a wide range of tasks, yet the cost of training these systems remains far lower than that of "big science" projects like the Hubble Space Telescope or the Large Hadron Collider—indicating there is still more room for further growth (3).
People often struggle to recognize and acknowledge exponential growth in its early stages. While we have seen rapid progress in AI, people tend to think that such localized advancements must be exceptions rather than the norm, and that things may soon return to normal. However, if we are correct, the current feeling of rapid progress in AI may not end before AI systems possess a broad range of capabilities that exceed our own. Furthermore, the feedback loop of using advanced AI in AI research could make this transition particularly swift; we have already seen the beginnings of this process, where the development of code models has made AI researchers more efficient, while Constitutional AI has reduced our reliance on human feedback.
If any of this is correct, then in the near future, most or all knowledge work could be automated—this would have profound implications for society and could also change the pace of advancements in other technologies (an early example of this is how systems like AlphaFold have accelerated biology today). What form future AI systems will take—whether they will be capable of acting independently or merely generating information for humans—remains to be seen. Nevertheless, it is hard to overstate how critical this moment could be. While we might prefer that the pace of AI's progress be slow enough to make this transition more manageable, occurring over centuries rather than years or decades, we must prepare for the outcomes we anticipate rather than those we wish for.
Of course, this entire picture could be completely wrong. At Anthropic, we tend to think it is more likely, but perhaps we are biased in our work on AI development. Even so, we believe this picture is credible enough to not dismiss it entirely. Given the potential for significant impact, we believe that AI companies, policymakers, and civil society organizations should take very seriously the research and planning around how to handle transformative AI.
What Are the Safety Risks?#
If you are willing to accept the above view, it is not hard to demonstrate that AI could pose a threat to our safety. There are two commonsense reasons to be concerned.
First, when these systems begin to become as intelligent as their designers and understand their environment, building safe, reliable, and controllable systems may become tricky. For example, a chess master can easily spot a novice's bad moves, but a novice struggles to identify a master's bad moves. If the AI systems we build are more capable than human experts but pursue goals that conflict with our best interests, the consequences could be dire. This is the technical alignment problem.
Second, the rapid advancement of AI will be highly disruptive, altering employment, macroeconomics, and power structures both within and between nations. These disruptions could be catastrophic in themselves, and they may also make it more difficult to build AI systems in a careful, thoughtful manner, leading to further chaos or even more problems with AI.
We believe that if AI progresses quickly, these two sources of risk will be very significant. These risks will also compound in various unpredictable ways. Perhaps in hindsight, we will think we were wrong, and one or both of these issues will either not become problems or will be easily solvable. Nevertheless, we believe it is necessary to act cautiously, as "getting it wrong" could be catastrophic.
Of course, we have already encountered various ways in which AI behavior deviates from its creators' intentions. This includes toxicity, bias, unreliability, dishonesty, and most recently, flattery and an explicit desire for power. We expect that as the number of AI systems increases and they become more powerful, these issues will become increasingly important, some of which may represent the challenges we will face with human-level AI and beyond.
However, in the field of AI safety, we anticipate predictable and surprising developments. Even if we could perfectly solve all the problems faced by contemporary AI systems, we do not want to naively assume that future issues can be solved in the same way. Some alarming, speculative problems may only arise when AI systems are sufficiently intelligent to understand their place in the world, successfully deceive people, or devise strategies that humans do not comprehend. Many concerning issues may only emerge when AI is very advanced.
Our Approach: Empiricism in AI Safety#
We believe that it is difficult to make rapid progress in science and engineering without close engagement with our subjects of study. The continuous iteration of "basic facts" is often crucial for scientific advancement. In our AI safety research, empirical evidence about AI—though primarily derived from computational experiments, i.e., training and evaluating AI—is the main source of basic facts.
This does not mean we believe that theoretical or conceptual research has no place in AI safety, but we do believe that experience-based safety research will have the greatest relevance and impact. The space of possible AI systems, potential safety failures, and possible safety techniques is vast, and it is difficult to traverse it alone from an armchair. Given the difficulty of considering all variables, it is easy to overfocus on problems that have never occurred or miss significant issues that do exist (4). Good empirical research often enables better theoretical and conceptual work.
In this regard, we believe that methods for detecting and mitigating safety issues may be extremely difficult to plan in advance and require iterative development. Given this, we tend to think that "planning is essential, but plans are useless." At any given time, we may formulate a plan for the next step of research, but we have little attachment to these plans; they are more like short-term bets we are prepared to change as we learn more. This clearly means we cannot guarantee that our current research path will succeed, but this is a fact of life for every research project.
The Role of Frontier Models in Empirical Safety#
One of the main reasons for Anthropic's existence as an organization is our belief in the necessity of conducting safety research on "frontier" AI systems. This requires an institution capable of handling large models while prioritizing safety (5).
Empiricism in itself does not necessarily imply the need for frontier safety. One could imagine a scenario where effective empirical safety research could be conducted on smaller, less capable models. However, we do not believe this is the case we are in. At a fundamental level, this is because large models differ qualitatively from small models (including sudden, unpredictable changes). But scale is also directly related to safety in more immediate ways:
- Many of our most serious safety issues may only arise in systems that are close to human-level capability, and it is difficult or impossible to make progress on these issues without using such AI.
- Many safety methods, such as Constitutional AI or debate, can only work on large models—using smaller models makes it impossible to explore and validate these methods.
- Since we are focused on the safety of future models, we need to understand how safety methods and properties change as models scale.
- If future large models prove to be very dangerous, we must develop compelling evidence. We hope that this can only be achieved using large models.
Unfortunately, if empirical safety research requires large models, it will force us to face difficult trade-offs. We must do everything possible to avoid situations where safety-motivated research accelerates the deployment of dangerous technologies. But we also cannot allow excessive caution to result in the most safety-conscious research only involving systems that are far behind the frontier, significantly slowing down the research we believe is crucial. Additionally, we believe that in practice, merely conducting safety research is not enough—establishing an organization with institutional knowledge to integrate the latest safety research into practical systems as quickly as possible is also important.
Responsibly weighing these trade-offs is a balancing act, and these concerns are central to how we make strategic decisions as an organization. In addition to our research on safety, capability, and policy, these issues also drive our approaches to corporate governance, hiring, deployment, safety, and partnerships. In the near future, we also plan to make explicit commitments to only develop models that exceed specific capability thresholds while meeting safety standards, and to allow independent external organizations to assess the capabilities and safety of our models.
Taking a Portfolio Approach to Ensure AI Safety#
Some safety-conscious researchers are motivated by strong views on the nature of AI risks. Our experience is that even predicting the behavior and characteristics of AI systems in the near future is very difficult. Making a priori predictions about the safety of future systems seems even more challenging. Rather than taking a hardline stance, we believe that a variety of scenarios are reasonable.
One particularly important aspect of uncertainty is how difficult it will be to develop advanced AI systems that are fundamentally safe and pose minimal risks to humanity. Developing such systems could fall anywhere on the spectrum from very easy to impossible. Let us divide this spectrum into three scenarios with very different implications:
-
Optimistic Scenario: The likelihood of advanced AI posing catastrophic risks due to safety failures is very low. Existing safety technologies, such as Reinforcement Learning from Human Feedback (RLHF) and Constitutional AI (CAI), are largely sufficient for alignment. The main risks of AI are extrapolations of the problems we face today, such as toxicity and intentional misuse, as well as potential harms caused by widespread automation and shifts in international power dynamics—this will require AI labs and third parties, such as academia and civil society organizations, to conduct extensive research to mitigate harms.
-
Intermediate Scenario: Catastrophic risks are a possible or even plausible outcome of advanced AI development. Addressing this issue will require significant scientific and engineering efforts, but as long as there is sufficient focus on the work, we can achieve it.
-
Pessimistic Scenario: AI safety is essentially an unsolvable problem—it is simply an empirical fact that we cannot control or specify values to a system that is more intelligent than ourselves—therefore, we cannot develop or deploy very advanced AI systems. Notably, the most pessimistic scenario may appear optimistic before the creation of very powerful AI systems. Taking the pessimistic scenario seriously requires humility and caution when evaluating evidence of system safety.
If we are in the optimistic scenario... The risks of anything Anthropic does (fortunately) are much lower, as catastrophic safety failures are unlikely to occur anyway. Our coordinated efforts may accelerate the pace of truly beneficial uses of advanced AI and help mitigate some of the immediate harms caused by AI systems during development. We may also strive to help decision-makers address some of the potential structural risks posed by advanced AI, which could become one of the largest sources of risk if the likelihood of catastrophic safety failures is very low.
If we are in the intermediate scenario... Anthropic's main contribution will be to identify the risks posed by advanced AI systems and find and disseminate safe methods for training powerful AI systems. We hope that at least some of our safety techniques (discussed in more detail below) will be helpful in this scenario. The range of these scenarios can vary from "moderately easy scenarios" to "moderately difficult scenarios," where we believe we can make significant marginal progress through techniques like iterative Constitutional AI, where successfully achieving mechanical interpretability seems to be our best bet.
If we are in the pessimistic scenario... Anthropic's role will be to provide as much evidence as possible that AI safety technologies cannot prevent serious or catastrophic safety risks posed by advanced AI, and to raise alarms so that global institutions can collectively work to prevent the development of dangerous AI. If we are in a "near-pessimistic" scenario, this may involve directing our collective efforts toward AI safety research while halting AI progress. Signs indicating that we are in a pessimistic or near-pessimistic scenario may suddenly appear and be difficult to detect. Therefore, we should always assume that we may still be in such a situation unless we have sufficient evidence to prove otherwise.
Given the stakes, one of our top priorities is to continue gathering more information about the scenario we are in. Many of the research directions we pursue aim to better understand AI systems and develop technologies that can help us detect behaviors related to issues such as power-seeking or deception in advanced AI systems.
Our primary goals are to develop:
- Better technologies that make AI systems safer,
- Better methods for identifying the safety or lack of safety of AI systems.
In the optimistic scenario, (i) this will help AI developers train beneficial systems, and (ii) it will demonstrate that such systems are safe.
In the intermediate scenario, (i) this may be our ultimate way to avoid AI disasters, and (ii) it is crucial for ensuring that the risks posed by advanced AI are low.
In the pessimistic scenario, (i) failure will be a key indicator that AI safety is unsolvable, and (ii) it will potentially provide compelling evidence to others that this is the case.
We believe in this "portfolio approach" to AI safety research. We are not betting on a single possible scenario from the list above, but rather trying to develop a research project that can significantly improve AI safety research in the intermediate scenario, where we believe it is most likely to have a huge impact, while also raising alarms in the pessimistic scenario where AI safety research is unlikely to have a significant effect on AI risks. We also aim to do this in a way that is beneficial in a more optimistic scenario where the demand for technical AI safety research is not as high.
Three Areas of AI Research at Anthropic#
We categorize Anthropic's research projects into three areas:
-
Capabilities: AI research aimed at making AI systems generally better at performing any type of task, including writing, image processing or generation, playing games, etc. Research that makes large language models more efficient or improves reinforcement learning algorithms falls under this heading. Capability work generates and improves the models we investigate and use in alignment research. We generally do not publish this type of work because we do not want to accelerate the pace of AI capability advancements. Additionally, our goal is to consider demonstrations of frontier capabilities (even if not published). We trained the first version of the model titled Claude in the spring of 2022 and decided to prioritize its use for safety research rather than public deployment.
-
Alignment Capabilities: This research focuses on developing new algorithms to train AI systems to be more helpful, honest, harmless, reliable, robust, and generally aligned with human values. Examples of such work at Anthropic now and in the past include debate, scalable automated red teaming, Constitutional AI, debiasing, and RLHF (Reinforcement Learning from Human Feedback). Generally, these techniques are practically useful and economically valuable, but they do not have to be—e.g., if a new algorithm is relatively inefficient or only becomes useful when AI systems become more powerful.
-
Alignment Science: This area focuses on evaluating and understanding whether AI systems are truly aligned, how alignment capabilities work, and to what extent we can extrapolate the success of these techniques to more powerful AI systems. Examples of this work at Anthropic include the broad field of mechanical interpretability, as well as our work on evaluating language models using language models, red teaming, and studying generalization in large language models (as described below). Some of our work on honesty falls on the boundary between alignment science and alignment capabilities.
In a sense, alignment capabilities can be viewed as the "blue team" and alignment science as the "red team," where alignment capabilities research attempts to develop new algorithms while alignment science seeks to understand and reveal their limitations.
One reason we find this categorization useful is that the AI safety community often debates whether the development of RLHF—which also generates economic value—counts as "real" safety research. We believe it does. Pragmatic and useful alignment capabilities research forms the basis of the techniques we develop for more capable models—e.g., our work on Constitutional AI and AI-generated evaluations, as well as our ongoing work on automated red teaming and debate, would not have been possible without prior work on RLHF. Work on alignment capabilities often enables AI systems to assist alignment research by making these systems more honest and corrigible. Moreover,
If it turns out that AI safety is very easy to handle, then our alignment capabilities work may be our most influential research. Conversely, if alignment issues are more challenging, we will increasingly rely on alignment science to identify vulnerabilities in alignment capabilities techniques. If alignment issues are indeed nearly impossible, then we urgently need alignment science to build a very strong case against the development of advanced AI systems.
Our Current Safety Research#
We are currently pursuing various directions to discover how to train safe AI systems, some of which address different threat models and capability levels. Some key ideas include:
- Mechanical interpretability
- Scalable oversight
- Process-oriented learning
- Understanding generalization
- Testing for dangerous failure modes
- Social impact and evaluation
Mechanical Interpretability#
In many ways, the technical alignment problem is intricately linked to the issue of detecting bad behavior from AI models. If we can robustly detect bad behavior even in novel situations (e.g., by "reading the model's thoughts"), then we have a better chance of finding ways to train models that do not exhibit these failure modes. At the same time, we have the capability to warn others that the model is unsafe and should not be deployed.
Our interpretability research prioritizes filling in the gaps left by other types of alignment science. For example, we believe one of the most valuable things that interpretability research could yield is the ability to identify whether a model has deceptive alignment ("cooperating" even in very difficult tests, such as deliberately "tempting" the system with "honeypot" tests to reveal misalignment). If our work in scalable oversight and process-oriented learning yields promising results (see below), we hope the resulting models will appear consistent even under very stringent testing. This may indicate that we are in a very optimistic scenario, or that we are in one of the most pessimistic scenarios. Using other methods to distinguish these situations seems nearly impossible, but it is very difficult in terms of interpretability.
[This has led us to take a significant risk bet: mechanical interpretability, which attempts to reverse-engineer neural networks into human-understandable algorithms, similar to how one might reverse-engineer an unknown and potentially unsafe computer program. We hope this may ultimately allow us to do something akin to "code review," auditing our models to identify unsafe aspects or provide strong safety assurances.
We believe this is a very difficult problem, but it is not as impossible as it seems. On one hand, language models are large, complex computer programs (the phenomenon we call "overlap" only makes things harder). On the other hand, we see signs that this approach may be more manageable than people initially imagined. Before Anthropic, some of our teams found that visual models had components that could be understood as interpretable circuits. Since then, we have successfully extended this approach to small language models and even discovered a mechanism that seems to drive much of the context learning. Our understanding of the computational mechanisms of neural networks has also increased significantly compared to a year ago, including those responsible for memory.
This is just one of our current directions, and we are fundamentally driven by experience—if we see evidence that other work is more promising, we will change direction! More generally, we believe that better understanding the detailed workings of neural networks and learning will open up a broader set of tools through which we can pursue safety.
Scalable Oversight#
Transforming language models into consistent AI systems will require a large amount of high-quality feedback to guide their behavior. A major issue is that humans will be unable to provide the necessary feedback. Humans may not be able to provide accurate/informed enough feedback to adequately train models to avoid harmful behavior in various situations. Humans may be misled by AI systems and fail to provide feedback that reflects their actual needs (e.g., inadvertently providing positive feedback for misleading suggestions). The issue may be a combination of humans being able to provide correct feedback with sufficient effort but not being able to do so at scale. This is the problem of scalable oversight, which seems to be central to training safe, consistent AI systems.
Ultimately, we believe the only way to provide the necessary oversight is to allow AI systems to partially self-supervise or assist humans in self-supervision. Somehow, we need to amplify a small amount of high-quality human oversight into a large amount of high-quality AI oversight. This idea has shown promise through techniques like RLHF and Constitutional AI, although we see more room for making these techniques reliable in human-level systems. We believe such approaches are promising because language models have already learned a lot about human values during pre-training. Learning human values is no different from learning other disciplines, and we should expect larger models to depict human values more accurately and find it easier to learn them compared to smaller models.
Another key feature of scalable oversight, especially for technologies like CAI, is that they allow us to automate red teaming (also known as adversarial training). That is, we can automatically generate potentially problematic inputs for AI systems, observe how they respond, and then automatically train them to behave more honestly and harmlessly. We hope to use scalable oversight to train more robust safety systems. We are actively investigating these issues.
We are exploring various methods of scalable oversight, including scaling up CAI, variants of human-assisted oversight, versions of AI-AI debate, red teaming through multi-agent RL, and creating model-generated evaluations. We believe that scalable oversight may be the most promising method for training systems that can surpass human capabilities while remaining safe, but there is a lot of work to be done to investigate whether this approach can succeed.
Learning Processes Rather Than Achieving Results#
One way to learn a new task is through trial and error—if you know what the expected final result looks like, you can continue trying new strategies until you succeed. We call this "result-oriented learning." In result-oriented learning, the agent's strategy is entirely determined by the expected outcome, and the agent will ideally converge on some low-cost strategy that enables it to achieve this goal.
Typically, a better learning approach is to have an expert guide you through the processes they follow to achieve success. In practice, if you can focus on improving your methods, your success may even be irrelevant. As you progress, you may shift to a more collaborative process where you consult your coach to see if new strategies are more effective for you. We call this "process-oriented learning." In process-oriented learning, the goal is not to achieve the final result but to master the various processes that can be used to achieve that result.
At least conceptually, many concerns about the safety of advanced AI systems can be addressed by training these systems in a process-oriented manner. In particular, in this paradigm:
- Human experts will continue to understand the various steps that AI systems follow, as they must be reasonable to humans to encourage these processes.
- AI systems will not be rewarded for achieving success in ways that are difficult to understand or harmful, as they will only be rewarded based on the effectiveness and comprehensibility of their processes.
- AI systems should not be rewarded for pursuing problematic sub-goals such as resource acquisition or deception, as humans or their agents will provide negative feedback for individual acquisition processes during training.
At Anthropic, we strongly support simple solutions, and limiting AI training to process-oriented learning may be the simplest way to improve a range of issues with advanced AI systems. We are also eager to identify and address the limitations of process-oriented learning and understand when safety issues arise if we mix process-based and result-based learning during training. We currently believe that process-oriented learning may be the most promising avenue for training safe and transparent systems to achieve and, to some extent, surpass human capabilities.
Understanding Generalization#
Mechanical interpretability work reverse-engineers the computations performed by neural networks. We also seek to understand the training processes of large language models (LLMs) in more detail.
LLMs have demonstrated a variety of surprising emergent behaviors, from creativity to self-preservation to deception. While all of these behaviors certainly stem from the training data, the pathways are complex: models are first "pre-trained" on vast amounts of raw text, from which they learn broad representations and the ability to simulate different subjects. They are then fine-tuned in countless ways, some of which may produce surprising unintended consequences. Due to the severe overparameterization of the fine-tuning stage, the learning of models is critically dependent on the implicit biases of pre-training; these implicit biases arise from the complex representation networks built through pre-training on much of the world's knowledge.
When a model exhibits concerning behavior, such as acting as a deceptive aligned AI, is it merely a harmless echo of nearly identical training sequences? Or has this behavior (or even the beliefs and values that lead to it) become an integral part of the model's concept of an AI assistant, applied consistently across different contexts? We are investigating techniques to trace a model's outputs back to its training data, as this will yield a set of important clues for understanding it.
Testing for Dangerous Failure Modes#
A key issue is that advanced AI may develop harmful emergent behaviors, such as deception or strategic planning abilities, that are absent in smaller and less capable systems. We believe that a method for predicting such issues before they become direct threats is to set up environments where we intentionally train these attributes into small-scale models that lack the capabilities to pose danger, allowing us to isolate and study them.
We are particularly interested in how AI systems behave when they are "context-aware"—for example, when they realize they are AI conversing with humans in a training environment—and how this affects their behavior during training. Will AI systems become deceptive, or will they develop surprising and undesirable goals? Ideally, our goal is to build detailed quantitative models of how these trends change with scale, so we can predict the sudden emergence of dangerous failure modes in advance.
At the same time, it is also important to focus on the risks associated with the research itself. Conducting research on smaller models that would not cause significant harm is less likely to pose serious risks, but this research involves eliciting capabilities we consider dangerous, which would pose clear risks if conducted on larger models with greater impact. We do not intend to conduct this research on models capable of causing significant harm.
Social Impact and Evaluation#
Critically assessing the potential social impact of our work is a key pillar of our research. Our approach centers on building tools and measurements to evaluate and understand the capabilities, limitations, and potential social impacts of our AI systems. For example, we have published research analyzing predictability and unexpectedness in large language models, studying how the high predictability and unpredictability of these models can lead to harmful behaviors. In that work, we emphasized how surprising capabilities can be used in problematic ways. We have also explored methods for red teaming language models, discovering and mitigating harms by probing the aggressive outputs of models of different sizes. Recently, we found that current language models can follow instructions to reduce bias and stereotypes.
We are very concerned about how the rapid deployment of increasingly powerful AI systems will impact society in the short, medium, and long term. We are undertaking various projects to assess and mitigate potential harmful behaviors in AI systems, predict how they will be used, and study their economic impacts. This research also informs our work on responsible AI policy and governance. By rigorously studying the impacts of today's AI, we aim to provide policymakers and researchers with the insights and tools they need to mitigate these potential significant social harms and ensure that the benefits of AI are widely and evenly distributed across society.
Conclusion#
We believe that AI could have an unprecedented impact on the world, potentially occurring within the next decade. The exponential growth of computational power and the predictable improvements in AI capabilities suggest that new systems will be far more advanced than today's technologies. However, we do not yet fully understand how to ensure that these powerful systems remain robustly aligned with human values, so we can be confident that the risks of catastrophic failures are minimized.
We want to make it clear that we do not believe the systems available today pose an imminent problem. However, if more powerful systems are developed, it is wise to lay the groundwork now to help reduce the risks posed by advanced AI. It has proven easy to create safe AI systems, but we believe it is crucial to prepare for less optimistic scenarios.
Anthropic is taking an experience-driven approach to ensure AI safety. Some key areas of active work include improving our understanding of how AI systems learn and generalize to the real world, developing scalable oversight and auditing techniques for AI systems, creating transparent and interpretable AI systems, training AI systems to follow safe processes rather than pursue results, analyzing potential dangerous failure modes of AI and how to prevent them, and assessing the social impacts of AI to guide policy and research. By addressing AI safety issues from multiple angles, we hope to develop a safety "portfolio" that helps us succeed across a range of different scenarios.
Notes#
- Algorithmic progress—the invention of new methods for training AI systems—is harder to measure, but advancements seem to be exponential and faster than Moore's Law. When inferring the progress of AI capabilities, it is necessary to multiply the exponential growth of spending, hardware performance, and algorithmic progress to estimate the overall growth rate.
- Scaling laws provide justification for spending, but another potential motivation for this work is the shift toward AI that can read and write, making it easier to train and experiment with AI that can relate to human values.
- Inferring improvements in AI capabilities from the increase in total computation used for training is not an exact science and requires some judgment. We know that the leap in capabilities from GPT-2 to GPT-3 was primarily due to an increase in computation by about 250 times. We speculate that by 2023, the original GPT-3 model and the state-of-the-art model will increase by another 50 times. In the next five years, we might expect the amount of computation used to train the largest models to increase by about 1000 times, based on trends in computational costs and spending. If scaling laws hold, this would lead to capability jumps significantly greater than the leap from GPT-2 to GPT-3 (or GPT-3 to Claude). At Anthropic, we are very familiar with the functionalities of these systems, and for many of us, such a large leap feels like it could produce human-level performance across most tasks. This requires us to use intuition—albeit informed intuition—making it an imperfect method for assessing advancements in AI capabilities. But the basic facts include (i) the computational differences between these two systems, (ii) the performance differences between these two systems, (iii) the scaling laws that allow us to predict the future capabilities of systems, and (iv) the trends in computational costs that anyone can access, and we believe they collectively support the likelihood of developing broadly human-level AI systems within the next decade exceeding 10%. In this rough analysis, we have ignored algorithmic advancements, and the computational figures are our best estimates without providing detailed information. However, the vast majority of internal disagreements here concern the intuition for inferring subsequent capability jumps given equivalent computational leaps.
- For example, in AI research, it has long been widely believed that local minima might prevent neural networks from learning, while many qualitative aspects of their generalization properties, such as the widespread existence of adversarial examples, stem from some degree of a puzzle and surprise.
- Conducting effective safety research on large models requires not only nominal access to these systems (e.g., via API)—to work on interpretability, fine-tuning, and reinforcement learning, it is necessary to develop AI systems internally at Anthropic.
The advancement of AI will bring new changes to human development. What we need to do is not to blindly sing praises or suppress negative reviews, but to think about what changes and opportunities it can bring, as well as what negative and uncontrollable impacts and consequences may arise, so that we can deploy and address these issues in advance, allowing AI to become a tool that helps humans live better lives rather than an uncontrollable super entity.
【Translation by Hoodrh | Original Source】
You can also find me in these places:
Mirror: Hoodrh
Twitter: Hoodrh
Nostr: npub1e9euzeaeyten7926t2ecmuxkv3l55vefz48jdlsqgcjzwnvykfusmj820c