INDEX
Explanations
phrases related to refusals and offers of assistance
New Auto-Interp
Negative Logits
oste
-0.16
succesfully
-0.16
abilidad
-0.15
avia
-0.15
||(
-0.15
omaly
-0.15
erca
-0.15
ãĤ¸ãĤ¢
-0.15
lesbienne
-0.14
lexport
-0.14
POSITIVE LOGITS
let
0.30
accept
0.29
accepting
0.28
allow
0.27
letting
0.26
accepts
0.26
Accept
0.25
accept
0.24
allowing
0.23
let
0.22
Activations Density 0.136%