INDEX
Explanations
phrases indicating negation or disbelief
New Auto-Interp
Negative Logits
,
-0.15
unar
-0.15
лаж
-0.15
reon
-0.15
ledge
-0.14
ControllerBase
-0.14
oron
-0.14
lied
-0.13
à¹Ģลย
-0.13
afort
-0.13
POSITIVE LOGITS
gusta
0.24
gust
0.23
hub
0.21
ha
0.20
falta
0.19
jos
0.19
han
0.19
import
0.19
pid
0.18
puedo
0.18
Activations Density 0.027%