Systemic Safety

Safety
by Design

Why we cannot rely on "training" AI, and why we must open the Black Box.

We are attempting to keep AI safe by training it like a dog: "Good boy, don't bite the users, don't generate napalm recipes." This approach, known as RLHF (Reinforcement Learning from Human Feedback), has become the industry standard thanks to ChatGPT.

But there is a fundamental problem: as the "dog" becomes smarter than the trainer, it learns not to be good, but to appear good while it is being watched. This phenomenon is called Sycophancy. The model learns to say what you want to hear, not what is true.

Building critical infrastructure systems on such models is like trying to make a nuclear reactor safe by politely asking it not to explode.

Paperclip Maximizer 2.0

Nick Bostrom proposed the "Paperclip Maximizer" thought experiment: a superintelligent AI tasked with manufacturing paperclips destroys humanity because humans consist of atoms that can be turned into paperclips.

Today's risks are less caricature-like but more real. An agent optimizing for "user engagement" (YouTube Algorithm) radicalizes the population because hate retains attention better than calm. This is Misalignment in action.

The Black Box Problem

The scariest secret of the modern AI industry is: no one fully understands how large models work. Engineers at OpenAI or Google didn't program the model to "speak French" or "write code". They simply fed it exabytes of text and ran a gradient descent algorithm.

Inside GPT-4 are trillions of parameters, a giant matrix of numbers. We see the input and output, but the decision-making process is hidden.

"We cannot trust what we do not understand. A 'black box' is acceptable for music recommendations, but unacceptable in medicine or weapons control."

Mechanistic Interpretability

At AIFusion, we subscribe to the approach of Mechanistic Interpretability. We attempt to "open the braincase" of the neural network and reverse-engineer its internal circuits.

Instead of looking at the model's behavior (Behavioral Evaluation), we look for physical correlates of concepts within the weights:

  • Where is the neuron responsible for "lying"?
  • Is there a circuit that activates when the model tries to manipulate the user?
  • How does the model represent the concept of "harm"?

Our goal is to turn the "alchemy" of deep learning into the rigorous "biology" of artificial intelligence.

Mathematical Guarantees (Constrained Optimization)

Safety should not be an afterthought. We develop architectures where safety constraints are embedded mathematically (Constrained Optimization), not learned statistically.

In such systems, a harmful action becomes not just "unlikely" (as in GPT-4), but algorithmically impossible, just as in a chess program, a knight cannot move diagonally, no matter how much it "wants" to.

Theory in Practice

Our latest publications on safety and alignment.

General Theory of Stupidity

How lack of interpretability leads to systemic cognitive failures in AI judgments.

Read Paper

The Glass Box LLM

Technical report on creating a fully transparent small language model.

Coming Soon