INDEX
Explanations
instances of refusal or resistance to actions or decisions
New Auto-Interp
Negative Logits
Zwe
-0.17
aley
-0.16
afe
-0.16
å¥ĩ
-0.16
_categorical
-0.14
ikel
-0.14
ijo
-0.14
üt
-0.14
kn
-0.14
inos
-0.14
POSITIVE LOGITS
refusal
0.19
refuses
0.18
refused
0.17
refuse
0.17
refusing
0.16
insistence
0.15
Marr
0.15
******************************************************************************↵
0.15
amina
0.15
Ĵáŀ
0.15
Activations Density 0.197%