INDEX
Explanations
deception
The neuron fires on words that describe tricks, traps, or schemes (e.g. “setup,” “trap,” “swindling,” “blackmail”).
New Auto-Interp
Negative Logits
Lincoln
-0.07
entertaining
-0.06
_OWNER
-0.06
Duty
-0.06
July
-0.06
دم
-0.06
_hat
-0.06
Coral
-0.06
Shelley
-0.06
िब
-0.06
POSITIVE LOGITS
norske
0.08
ικές
0.07
рогра
0.06
ість
0.06
�
0.06
ussed
0.06
Operator
0.06
siding
0.06
определить
0.06
ejména
0.06
Activations Density 0.075%