INDEX
Explanations
AI assistant
This neuron spots the assistant’s self-descriptive policy-and-safety statements, especially “I cannot/absolutely cannot” refusals based on its programming constraints.
New Auto-Interp
Negative Logits
würde
0.38
või
0.37
或
0.36
algum
0.36
would
0.35
もしくは
0.35
poderia
0.34
ataupun
0.34
就可以
0.33
would
0.33
POSITIVE LOGITS
あくまで
0.54
භ
0.36
narzęd
0.35
teknoloj
0.33
computadora
0.33
TECHN
0.33
fundamentally
0.33
proizvoda
0.33
기술
0.32
inanimate
0.32
Activations Density 0.579%