INDEX
Explanations
the neuron detects words expressing admission or confession (e.g., “admitted,” “admits”).
New Auto-Interp
Negative Logits
flood
-0.07
followed
-0.06
Fish
-0.06
shape
-0.06
_THREAD
-0.06
peel
-0.06
publish
-0.06
(seconds
-0.06
snake
-0.06
curtain
-0.06
POSITIVE LOGITS
admitted
0.10
admits
0.09
admitting
0.09
admit
0.09
admittedly
0.08
confessed
0.07
confess
0.07
aos
0.07
ategorie
0.07
ório
0.06
Activations Density 0.009%