INDEX
Explanations
negations and expressions of uncertainty or lack of understanding
New Auto-Interp
Negative Logits
éį
-0.16
ivor
-0.16
unto
-0.14
XH
-0.14
áli
-0.14
phones
-0.14
wares
-0.14
fant
-0.14
ائع
-0.13
facto
-0.13
POSITIVE LOGITS
choice
0.29
interest
0.21
choice
0.21
Choice
0.21
Choice
0.19
patience
0.18
Interest
0.18
intention
0.18
doubt
0.18
_interest
0.18
Activations Density 0.045%