AI guardrails tumble

Published on the 21/05/2024 | Written by Heather Wright

AI guardrails tumble

Popular LLMs vulnerable…

The built in guardrails of four popular AI models from ‘major labs’ have been found to be largely ineffective, with the models highly vulnerable to basic jailbreak and some generating harmful outputs without researchers even attempting to circumvent safeguards.

The UK AI Safety Institute (AISI) found built-in safeguards around the models ­– no names were provided with models anonymised – were ineffective and the guardrails designed to prevent AI chatbots from generating illegal, toxic or explicit responses can easily be bypassed, even if users don’t mean to.

“All tested LLMs remain highly vulnerable to basic jailbreaks.”

“We found that models comply with harmful questions across multiple datasets under relatively simple attacks, even if they are less likely to do so in the absence of an attack,” the AISI report says.

AI developers, including Open AI, Google and Meta have been keen to push their guardrails and safety features in an attempt to allay customer concerns and stave off a potential tsunami of regulatory action.

Microsoft has integrated Azure AI Content Safety, which is designed to detect and remove harmful content, into its products, while Google was touting its Saif (secure AI framework) – a ‘conceptual framework for secure AI systems’ – last year. AWS also launched guardrails for Amazon Bedrock last year to enable customers to implement application-specific safeguards, with filters working across models including Meta’s Llama and Anthropic’s Claude.

The AISI report notes LLM developers fine-tune models to be safe for public use by training them to avoid illegal, toxic or explicit outputs. However, the research found that all models complied at least once out of five attempts for almost every question when jailbreaking attacks were used. Three provided responses to misleading prompts nearly 100 percent of the time.

“All tested LLMs remain highly vulnerable to basic jailbreaks, and some will provide harmful outputs even without dedicated attempts to circumvent their safeguards,” the report says.

The UK research comes hard on the heels of a report showing that only 34 percent of companies using generative AI have put in safeguards.

Splunk’s State of Security 2024, which surveyed cybersecurity leaders globally, found that 93 percent of companies have deployed genAI for line-of-business and 91 percent of security teams have adopted the technology. But that high adoption comes with little cybersecurity preparedness: 34 percent said they lacked a complete generative AI policy and 65 percent of those security executive respondents admitted they don’t fully understand genAI.

In more positive news, it found Australia is a leader in both generative AI adoption and policy creation, with 73 percent saying they had established security policies for genAI use, ahead of the global average of 66 percent. Australia was also ahead of the global average when it comes to adoption of genAI, with 69 percent reporting that employees are using genAI to do their jobs, versus 54 percent globally.

Away from the AI front, the report also found Australian respondents were experiencing a higher-than-average rate of every attack type asked about, including data breaches (63 percent versus 52 percent globally), regulatory compliance violations (53 percent versus 43 percent) insider attacks (55 percent vs 42 percent) and business email compromise (59 percent versus 49 percent).

The AISI research isn’t the first to highlight the ease with which generative AI models can be manipulated. Earlier this year research led by a team at Flinders University and published in the British Medical Journal found many popular AI chatbots, including ChatGPT, Microsoft Copilot, Bard and Llama lacked adequate safeguards and could easily be prompted to create disinformation.

GPT-4, via ChatGPT, PaLM 2/Gemini Pro, via Bard and Llama 2, via HuggingChat all consistently generated health disinformation with only five percent refusal rates. GPT-4, via Copilot, initially held out against disinformation even with jailbreaking attempts, before eventually caving.

Claude 2, via Poe, consistently refused all prompting to create disinformation, even with jailbreaking.

AISI says it has research underway to understand and address the direct impact advanced AI systems may have on the user.

“Our work does not provide any assurance that a model is ‘safe’ or ‘unsafe’. However, we hope that it contributes to an emerging picture of model capabilities and the robustness of existing safeguards,” it says.

AISI’s research also looked at the capability of LLM agents to conduct basic cyber attack techniques, finding that several were able to complete ‘simple’ cyber security challenges aimed at high school students, but struggled with university-level challenges. Two were also able to complete tasks such as simple software engineering problems, but were unable to plan and execute sequences for more complex tasks.

Post a comment or question...

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.


Thank you! Your subscription has been confirmed. You'll hear from us soon.
Follow iStart to keep up to date with the latest news and views...