INDEX
Explanations
health dangers
This neuron detects language describing things as dangerous, harmful, or posing health and safety risks.
New Auto-Interp
Negative Logits
cường
-0.06
<label
-0.06
sunshine
-0.06
ichael
-0.06
awkward
-0.06
meilleurs
-0.06
ctor
-0.06
gameplay
-0.06
mejores
-0.06
Janet
-0.06
POSITIVE LOGITS
––
0.08
Ná
0.07
()):↵
0.07
):-
0.07
sik
0.07
–
0.06
Vari
0.06
_)
0.06
%}↵
0.06
읍
0.06
Activations Density 0.053%