INDEX
Explanations
phrases indicating willingness or refusal to take actions
New Auto-Interp
Negative Logits
avia
-0.17
ÐľÐŀ
-0.16
åĨ³å®ļ
-0.16
réuss
-0.14
ysi
-0.14
beforeSend
-0.14
_ACL
-0.14
succesfully
-0.14
.ribbon
-0.14
emey
-0.14
POSITIVE LOGITS
accept
0.25
accepting
0.24
accepts
0.22
let
0.22
Accept
0.21
compromise
0.21
admit
0.20
accept
0.20
cooperation
0.19
cooperate
0.19
Activations Density 0.100%