INDEX
Explanations
negations and conditional phrases indicating refusal or limitations
New Auto-Interp
Negative Logits
äl
-0.16
ndx
-0.16
reet
-0.16
zdy
-0.15
iên
-0.15
rar
-0.14
ripp
-0.14
iedade
-0.14
lew
-0.14
ntax
-0.13
POSITIVE LOGITS
be
0.17
diá»ħn
0.16
iece
0.16
åĵ¡
0.15
quet
0.14
åijĺ
0.14
sul
0.14
rut
0.14
-linear
0.14
ogo
0.14
Activations Density 0.074%