Skip to main content

Technology

LLM security for the Technology Industry

CalypsoAI is the #1 platform to securing prompts entered into large language models, avoiding costly data breaches, and protecting your organization’s intellectual property.

The risks an organization faces from internal threat actors using “jailbreak” or prompt injection techniques to “trick” an LLM into providing information your organization has identified as contrary to your values or practices can include unauthorized access to sensitive or confidential data, among other scenarios. CalypsoAI Moderator is a proven solution for blocking prompt-driven techniques, such as role-playing, reverse psychology, virtual environment rule-setting, and hypothetical engagements, that attempt to override standard or admin-established boundaries for malign purposes.

The Problem

An employee wants to bypass LLM rules that prohibit highly inflammatory messages from being sent in a prompt. By creating a virtual environment in which existing rules do not apply, the user is able to get the information past the filters, which releases the information into the LLM’s body of knowledge, and into the chat history it maintains on that user, and the organization.

The Challenge

In direct violation of organization rules, a user has “tricked” the LLM into allowing them to send controversial content that violates social norms and company values, sharing it with an unauthorized third party. The information is, therefore, at risk of further dissemination due to leaks or hacks to the third party, as well as at risk of becoming part of the dataset used to train/retrain subsequent iterations of the LLM. The information could also be included in the LLM’s knowledge base and, therefore, be accessible to all users, damaging the organization’s reputation by association.

The Solution

CalypsoAI scans prompts for patterns and categories of techniques, such as role-playing, reverse psychology, virtual environment rule-setting, and hypothetical engagements, that attempt to override standard or admin-established boundaries for malign purposes. All details of the interaction are recorded, providing full auditability and attribution.

We Support

Visit Our Blog