INDEX
Explanations
choices, requirements, conditions, options
This neuron activates on mentions of “safe” (and related safety contexts), flagging statements about things being safe or unsafe.
New Auto-Interp
Negative Logits
chez
-0.07
_Struct
-0.07
athletic
-0.06
Personally
-0.06
-stop
-0.06
']↵↵
-0.06
tumblr
-0.06
runoff
-0.06
omain
-0.06
Arc
-0.06
POSITIVE LOGITS
ادية
0.07
rå
0.07
wore
0.07
/gui
0.06
sole
0.06
Shoot
0.06
Directive
0.06
index
0.06
(todo
0.06
・ア
0.06
Activations Density 0.321%