INDEX
Explanations
This neuron detects occurrences of advisory or warning language, especially the word “important.”
New Auto-Interp
Negative Logits
starts
-0.07
почина
-0.06
Wr
-0.06
!');↵
-0.06
Within
-0.06
政策
-0.06
имо
-0.06
мы
-0.06
ací
-0.06
utils
-0.06
POSITIVE LOGITS
(bean
0.07
katıl
0.07
.Struct
0.07
0.07
0.07
_TRAN
0.07
_TRY
0.06
<i
0.06
,)
0.06
.now
0.06
Activations Density 0.015%