Q & A
4 minute read

In AI, alignment is the goal. Steerability is how you get there

IBM Fellow Kush Varshney explains how IBM Research is responding to generative AI's evolving safety risks.

IBM's Kush Varshney discusses why AI models need to be safe.

Years before ChatGPT became a household name, Kush Varshney was thinking about the potential for AI models trained on massive data to misbehave. With colleagues at IBM Research, he helped build the tech industry’s first tools for reducing algorithmic bias and harm for AI-driven applications, ranging from hiring to lending.

Today, generative AI has introduced new concerns while amplifying the old ones around bias, privacy, security, and the potential for creating misinformation at scale. As head of human-centered trustworthy AI research at IBM Research, Varshney leads a team of designers, scientists and engineers trying to anticipate and mitigate all that could go wrong.

It’s been an eventful year. The team’s Granite Guardian models claimed six of the top 10 spots on the first independent benchmark of how well AI classifiers can detect harmful AI inputs and outputs and was made available in IBM watsonx.governance Guardrails. The team also published three new projects on Git Hub and Hugging Face that align with NIST's map, measure, and manage framework for mitigating AI risks:

map measure manage.png

Map: The AI Risk Atlas Nexus catalogs existing and emerging risks, and links to taxonomies created by NIST, MIT, and others. The Nexus also includes a chat interface for users to probe the safety risks of a particular use case.

Measure: More recently, the team unveiled ICX360 a tool for understanding the “thinking” behind an LLM’s outputs.

Manage: The team is actively researching steerability methods at inference time to shape LLM behavior.

We recently caught up with Varshney to talk about the future of AI safety.

What’s unique about IBM’s technologies?

The Risk Atlas Nexus has a knowledge graph behind it that lets us understand risk in a comprehensive, principled way. We can bring in new risks and do all sorts of advanced reasoning such as figuring out multi-step connections among risk dimensions. There’s also a natural language interface to help people who aren’t experts at risk taxonomies to understand what could go wrong with their AI application. It represents an evolution from the risk questionnaires in IBM OpenPages.

ICX360 takes a mathematical approach to explainability. People may try to figure out why an LLM responded the way it did by asking the model itself using chain-of-thought reasoning, among other methods. But that explanation itself can be hallucinated. With the ICX methods, you can see for yourself which words in the prompt led to the model’s outputs.

AI Steerability brings together a dozen steering methods in one place which is nice, but the big advantage is they target different parts of the lifecycle — the prompt, the model’s internal weights and state, and the decoding step when it outputs text. I may want to nudge the model in a certain direction but be unsure whether prompting, fine-tuning, or decoding is the best way to do it. The toolkit lets you compare apples-to-apples.

Apples-to-apples?

The toolkit lets you test all the different steering methods for the exact same desired behavior. It’s a way to determine empirically which steering method works best for your use-case and your model.

Are some steering methods generally more effective?

It will always depend on the data, the desired behavior, and the underlying LLM. Even with traditional machine learning, you could never say it's best to always use a decision tree, a support vector machine, or a neural network. It always depends on the characteristics of the data.

People used to talk a lot about aligning LLMs to control them. How is steering different?

In my mind, alignment is the goal, and steering is how we do it. Alignment is the end-product, so people needed another word. Fine tuning is too limited. It doesn’t capture all the ways to control model behavior. If you think about it, a car without a steering wheel can go in a straight line and that’s about it. To be useful, an AI model should be able to be “steered” so that it can generate text in a desired style, attribute factual claims to a source, or avoid spewing harmful outputs.

What was the biggest challenge in building these tools?

The field is moving so fast that knowing what to include and exclude, and not getting lost in the literature, was a big challenge.

What are the costs?

Prompting is the cheapest, fine-tuning and activation steering are more involved, and decoding is somewhere in the middle.

When we last spoke about AI governance, your team had identified 39 bad behaviors associated with generative AI. Two years later, that list has doubled to more than 80. What changed?

Agentic AI.

What agentic AI risk worries you most?

If AI agents take full control of tasks, humans can be left with no agency of their own. This would be bad for human development, dignity, and flourishing. Think about helicopter parents. By being overly involved in their children's lives, they can prevent their kids from learning to think and to act on their own. We don’t want AI to become a helicopter parent.

For example, if a human radiologist and an AI system are reading a medical image together, the radiologist’s agency can be reduced. There’s also the risk during collaborative brainstorming sessions that the AI might ignore ideas contributed by humans. If that happens, the humans will stop contributing, and that could slow innovation. We’ve cataloged other risks in the AI Risk Atlas, but the loss of human agency is top of mind for me.

How does AI steering tie into generative computing, an IBM Research initiative to structure generative AI so that it behaves more like traditional software?

Steering methods fall under the umbrella of inference-time interventions, or “intrinsics,” that IBM Research is developing for generative computing. For example, an intrinsic could prevent an AI agent from seizing control of a human-AI collaboration through an optimized prompt, a specialized adapter, or some form of activation steering. There could even be a functionality to automatically select the best steering method for a particular use-case.

What’s next?

Our main mission now is thinking about how to turn the trust and transparency tools that we’ve created and curated into intrinsics for generative computing and Mellea.

Related posts