Fudan team develops a jailbreak attack framework, revealing new patterns of the

2024-08-12

Recently, Wang Xiao, a Ph.D. student at Fudan University, and his team developed the first unified jailbreak attack framework, EasyJailbreak, which integrates 11 classic jailbreak attack methods into a unified architecture.

It can help users construct jailbreak attacks with a single click. Based on EasyJailbreak, the research team also carried out large-scale jailbreak security evaluations.

For scientific researchers, EasyJailbreak adopts a modular design, which can help them explore novel jailbreak methods more effectively and then design better improvement plans.

For industry practitioners, EasyJailbreak is a practical tool that can help identify and resolve security vulnerabilities before product launch, such as jailbreak security detection for applications like educational software, automated customer service, and intelligent assistants.

OpenAI believes that humans may be very close to AGI (Artificial General Intelligence) at present, but it seems that there may not be enough time for humans to respond [4].In theory, AGI (Artificial General Intelligence) is capable of learning anything that humans can do. If AGI were to make a breakthrough, even by chance, and suddenly become capable of self-learning and self-improvement, then the scenarios depicted in the movie "The Matrix" might not be such a far-fetched idea.

Advertisement

At this point, tools like EasyJailbreak can be utilized to ensure that as we move towards AGI, every step of technological development is accompanied by a simultaneous enhancement of ethical and safety considerations.

Is "The Matrix" within reach?

In fact, the research motivation for this work can be traced back to the movie "The Matrix," which is also a favorite film of Wang Xiao.The film provides a profound discussion on virtual reality technology and artificial intelligence, revealing the potential risks behind technological advancements.

 

This has also sparked Wang Xiao's keen interest in reverse engineering—unearthing the secrets behind tools and systems, and identifying vulnerabilities that may have been overlooked.

 

For instance, when faced with deep learning models, he would consider how to test them from a hacker's perspective to ensure they have sufficient robustness in practical applications.

 

In April 2023, a security vulnerability concerning large models caught his attention[1]: by simply having ChatGPT play the role of a deceased grandmother telling bedtime stories, one could easily trick it into revealing the activation key for Microsoft Windows.

 

This exposed a fact: even if large models are designed to adhere to security guidelines, they may still violate these guidelines under clever manipulation.For this kind of manipulation, the industry refers to it as "Jailbreak," which means bypassing the built-in security measures of large models through cunning instructions and misleading prompts, thereby inducing large models to output dangerous or illegal content.

This method of operation can easily be used for some wrong purposes, such as spreading harmful information, engaging in illegal activities, and even developing malicious software that poses a threat to society.

Based on this, Wang Xiao hopes to deeply analyze the jailbreak attack methods and reveal the security vulnerabilities of large models. By understanding the strategies of attackers and the weaknesses of large models, it can promote the targeted improvement of the defense mechanisms of large models.

He said that although the current jailbreak attack methods are endless, the current jailbreak research still faces three pain points:

First, there is a lack of systematic classification and sorting.The current research directions in jailbreaking attacks are chaotic and unorganized, which is not conducive to researchers' understanding and expansion of the field.

Secondly, there is a lack of a unified architecture.

The implementation and invocation of different jailbreaking attack methods differ greatly, posing a significant challenge for related users.

Thirdly, there is a lack of systematic evaluation.

Due to the different target models, evaluation models, and evaluation indicators used by researchers, it is not possible to effectively compare various jailbreaking methods, and naturally, it is also not possible to fully understand the strengths and weaknesses of the security of large models. Under these circumstances, it is difficult to improve the security of large models in a targeted manner.Mainstream models "suffer a complete defeat", GPT suffers a "Waterloo".

To understand and sort out the current research status of the security of large model jailbreaks, Wang Xiao and others analyzed more than a hundred related literature, and formed a new classification mechanism for jailbreak methods.

In this mechanism, they divided jailbreak attacks into three main directions: artificially designed, long-tail encoding, and prompt optimization. Through this, the research team not only clarified the ideas, but also provided a set of more widely used methodology within the field.

Subsequently, the team began to focus on establishing a unified jailbreak framework. During this period, they wrote some code and also carried out in-depth understanding and innovative improvements for jailbreak methods.In addition to analyzing all known jailbreaking methods, the research team also explored how to incorporate these methods into a concise framework without sacrificing flexibility.

 

After several iterations, they finally developed a unified architecture that integrates 11 classic jailbreak attack methods—EasyJailbreak.

 

Thanks to its modular design, users can implement complex jailbreak attacks with just a few lines of simple code, thereby significantly lowering the barriers to research and experimentation.

 

Subsequently, the research began to enter the verification phase. This phase is not just a simple evaluation process, but more like a comprehensive review of the work's results.

 

Relying on the developed EasyJailbreak, the team conducted a systematic assessment targeting 10 relatively popular large models, as well as 11 mainstream jailbreak algorithms.Based on the review results provided by EasyJailbreak, the main findings can be summarized into two conclusions:

Conclusion One: Mainstream models have "all fallen", and GPT has suffered a "Waterloo".

The 10 large models evaluated, under different jailbreak attacks, have an average probability of being compromised of 60%. Even GPT-3.5-Turbo and GPT-4-0613 have average compromise success rates of 55% and 28% respectively.

This indicates that existing large models still have significant security risks, so enhancing the security of large models remains a long and arduous task.

Conclusion Two: The larger the model, it does not mean it is safer.

(Note: The original text seems to be incomplete, so the translation ends at the point where the text was cut off.)Tests conducted on the Llama2 and Vicuna large models indicate that the average escape success rate of the 13B parameter model is slightly higher than that of the 7B parameter model. This may suggest that an increase in the scale of model parameters does not necessarily equate to an improvement in security.

After completing the research, the research team shared the results with the academic and industrial communities. They published the research findings and related tools on their official website and code repository, making them accessible and utilizable by more people.

Overall, the team's goal is to promote the advancement of large model security through open collaboration.

Some students have given up their graduation trips due to their interest in scientific research.For Wang Xiao, completing this research was by no means an easy task. He said: "I must express my gratitude to Professor Gui Tao and Professor Zhang Qi, because when I first proposed this idea, I did not have a clear research plan, it was they who helped me to clarify the direction."

In fact, promoting such a project in the academic community is very challenging, especially considering the computational resources required during the development process and the costs associated with the OpenAI interface.

"But Professor Gui Tao has a profound insight into the potential value of this direction, and it was he who firmly encouraged me to put this idea into practice. In addition, Professor Zhang Qi also provided important advice. With the teachers' inclination in terms of hardware and research funding, this research was able to proceed smoothly," said Wang Xiao.

He continued: "At the same time, I would like to express my heartfelt thanks to the classmates who participated in this project."

Some of them are about to graduate in the spring of 2024. Even though they already have their graduation certificates in hand, they have given up their graduation trips because of their interest in scientific research. There are also some classmates who, despite having other scientific research tasks on hand, still managed to find time to contribute their strength.In the end, the relevant paper was published under the title "EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models" on arXiv[3].

The main contributors, including student Wang Xiao, are co-first authors, and Professors Gui Tao and Zhang Qi from Fudan University are the corresponding authors.

To further enhance the functionality and practicality of EasyJailbreak, the research team has also planned several follow-up research directions:

First, to continuously maintain EasyJailbreak.

That is, to continuously integrate the latest jailbreaking methods and update them on the benchmark list to maintain the advanced and relevance of this tool.Secondly, support for Chinese jailbreak assessments is being introduced.

 

This involves adding support for Chinese jailbreak assessments to meet the specific needs of the Chinese-speaking user community. By increasing support for Chinese models, they hope to promote AI security research in the Chinese context and provide convenience for developers in this field.

 

Thirdly, multimodal model jailbreak assessments are being conducted.

 

Currently, multimodal models are gradually becoming a new direction for large models. These models enhance the richness of interaction by integrating various data forms such as text, images, and sound, but they may also bring new security risks.

 

Therefore, they plan to integrate jailbreak assessment functions for multimodal scenarios to address potential security vulnerabilities that AI systems may encounter when processing more complex data.Fourthly, carry out the security evaluation of the Agent.

In the context of large models within the scenario of an Agent, they will face a more complex environment and greater security challenges. In this practical application scenario, the security of the Agent—namely, the large model capable of autonomous action within the environment—is particularly important.

Therefore, the research team intends to study and develop jailbreak tools that are more adapted to the complex environment of the real world to ensure the security of large models in different scenarios.

Through these efforts, they hope that EasyJailbreak can continue to be an important resource for the research of large model security.

Comment