How Heretic Works
Heretic operates on a principle known as directional ablation, or "abliteration." It systematically identifies specific matrices within each transformer layer of a language model – primarily in the attention out-projection and MLP down-projection components. For these identified matrices, Heretic performs an orthogonalization with respect to a calculated "refusal direction." This process effectively inhibits the expression of censorship-related tendencies in the model's output.
The "refusal direction" itself is dynamically computed for each layer as the difference-of-means between the first-token residuals generated by "harmful" and "harmless" example prompts. A key innovation of Heretic is its use of a TPE-based parameter optimizer (powered by Optuna) to automatically find optimal ablation parameters. These parameters include a flexible ablation weight kernel and the ability to linearly interpolate between nearest refusal direction vectors, allowing for a vast exploration of potential decensoring directions beyond individual layer-specific ones. It also applies ablation parameters separately for different components, such as MLP and attention interventions, to minimize damage to the model's overall intelligence.
Why Use It
Heretic offers a powerful, automated solution for developers, researchers, and users seeking to deploy or experiment with less-restricted language models. Its primary benefit is the fully automatic removal of "safety alignment" without the need for time-consuming and costly post-training. This significantly lowers the barrier to entry for creating specialized, uncensored LLMs. The tool has demonstrated superior performance in maintaining the original model's intelligence, as evidenced by lower KL divergence scores compared to manually abliterated models, while achieving comparable reduction in refusal rates.
Beyond its core decensoring function, Heretic provides valuable research features for understanding model internals. Users can generate plots of residual vectors and analyze detailed residual geometry metrics, aiding in interpretability studies. This dual functionality makes Heretic an indispensable tool for both practical application and academic exploration of transformer models, enabling the creation of custom, high-quality, uncensored AI without extensive manual effort or deep expertise.