T4A
Tools4All
AI Tools published

Heretic

Heretic automatically removes censorship ("safety alignment") from transformer-based language models, retaining core intelligence without expensive...

H

Heretic is a revolutionary tool designed to eliminate built-in censorship or "safety alignment" from transformer-based language models. It achieves this automatically by leveraging an advanced implementation of directional ablation, often referred to as "abliteration," combined with a sophisticated TPE-based parameter optimizer. This unique approach allows users to decensor various LLMs, producing models with significantly reduced refusal rates while preserving their original intelligence and capabilities.

The tool democratizes the process of creating uncensored language models, making it accessible even to those without deep knowledge of transformer internals. By optimizing parameters to co-minimize refusal instances and KL divergence from the original model, Heretic generates high-quality decensored versions. It has garnered positive feedback from users who report models that provide detailed, uncensored responses to sensitive topics, outperforming other abliterated alternatives in retaining intelligence and delivering expected outputs.

How Heretic Works

Heretic operates on a principle known as directional ablation, or "abliteration." It systematically identifies specific matrices within each transformer layer of a language model – primarily in the attention out-projection and MLP down-projection components. For these identified matrices, Heretic performs an orthogonalization with respect to a calculated "refusal direction." This process effectively inhibits the expression of censorship-related tendencies in the model's output.

The "refusal direction" itself is dynamically computed for each layer as the difference-of-means between the first-token residuals generated by "harmful" and "harmless" example prompts. A key innovation of Heretic is its use of a TPE-based parameter optimizer (powered by Optuna) to automatically find optimal ablation parameters. These parameters include a flexible ablation weight kernel and the ability to linearly interpolate between nearest refusal direction vectors, allowing for a vast exploration of potential decensoring directions beyond individual layer-specific ones. It also applies ablation parameters separately for different components, such as MLP and attention interventions, to minimize damage to the model's overall intelligence.

Why Use It

Heretic offers a powerful, automated solution for developers, researchers, and users seeking to deploy or experiment with less-restricted language models. Its primary benefit is the fully automatic removal of "safety alignment" without the need for time-consuming and costly post-training. This significantly lowers the barrier to entry for creating specialized, uncensored LLMs. The tool has demonstrated superior performance in maintaining the original model's intelligence, as evidenced by lower KL divergence scores compared to manually abliterated models, while achieving comparable reduction in refusal rates.

Beyond its core decensoring function, Heretic provides valuable research features for understanding model internals. Users can generate plots of residual vectors and analyze detailed residual geometry metrics, aiding in interpretability studies. This dual functionality makes Heretic an indispensable tool for both practical application and academic exploration of transformer models, enabling the creation of custom, high-quality, uncensored AI without extensive manual effort or deep expertise.

Features

Fully automatic censorship removal
Optimizes ablation parameters using TPE (Optuna)
Co-minimizes refusals and KL divergence
Supports most dense, multimodal, and MoE models
Advanced research features for model interpretability

Use Cases

  • Decensoring existing language models
  • Creating uncensored LLMs for specific applications
  • Reducing refusal rates in AI assistants
  • Research into LLM safety alignment and semantics
  • Customizing LLMs for niche content generation

Tags

Last verified: February 17, 2026