INDEX

Explanations

AI assistant

This neuron spots the assistant’s self-descriptive policy-and-safety statements, especially “I cannot/absolutely cannot” refusals based on its programming constraints.

New Auto-Interp

Configuration

Prompts (Dashboard)

238,145 prompts, 512 tokens each

Dataset (Dashboard)

lmsys + oasst1

Embeds

IFrame

Link

Not in Any Lists

Negative Logits

 würde

0.38

 või

0.37

或

0.36

 algum

0.36

would

0.35

もしくは

0.35

 poderia

0.34

 ataupun

0.34

就可以

0.33

 would

0.33

POSITIVE LOGITS

あくまで

0.54

භ

0.36

 narzęd

0.35

 teknoloj

0.33

 computadora

0.33

TECHN

0.33

 fundamentally

0.33

 proizvoda

0.33

기술

0.32

 inanimate

0.32

Activations Density 0.579%