INDEX
Explanations
phrases indicating refusal or rejection of actions
New Auto-Interp
Negative Logits
xbf
-0.17
alli
-0.15
uppe
-0.15
riba
-0.14
mutable
-0.14
mrt
-0.14
olic
-0.14
uala
-0.14
réuss
-0.13
apur
-0.13
POSITIVE LOGITS
accept
0.29
accepting
0.28
accept
0.27
accepts
0.26
Accept
0.25
Accept
0.23
_accept
0.22
allow
0.21
acept
0.21
acceptance
0.21
Activations Density 0.127%