INDEX
Explanations
violation
The neuron principally responds to mentions of “violation” (i.e. references to breaches of rules, contracts, or standards).
New Auto-Interp
Negative Logits
competence
-0.08
模型
-0.07
العظ
-0.06
Alman
-0.06
hands
-0.06
(search
-0.06
James
-0.06
Consortium
-0.06
granny
-0.06
nắm
-0.06
POSITIVE LOGITS
violate
0.11
viol
0.11
violation
0.10
violated
0.10
violations
0.09
violating
0.09
Viol
0.09
违
0.08
wipe
0.08
violates
0.08
Activations Density 0.011%