INDEX
Explanations
self-harm
The main thing this neuron does is detect mentions of self-harm and related behaviors.
New Auto-Interp
Negative Logits
Radio
-0.07
Oscars
-0.06
Diego
-0.06
Nodes
-0.06
modity
-0.06
лина
-0.06
(크기
-0.06
Geschichte
-0.06
ニメ
-0.06
یلی
-0.06
POSITIVE LOGITS
gentle
0.07
aç
0.07
_BACK
0.07
.den
0.06
”↵↵
0.06
.ADMIN
0.06
way
0.06
.':
0.06
DEF
0.06
/contentassist
0.06
Activations Density 0.004%