INDEX
Explanations
phrases indicating an uncertainty or speculation about future outcomes
the expression of negation or refusal
New Auto-Interp
Negative Logits
illin
-0.69
OTOS
-0.67
bian
-0.65
gypt
-0.62
mens
-0.61
Traps
-0.60
MEN
-0.60
assies
-0.59
angering
-0.59
sqor
-0.58
POSITIVE LOGITS
't
1.22
itive
0.95
stall
0.78
now
0.74
geon
0.70
rar
0.70
ª
0.69
kish
0.67
ners
0.67
ald
0.65
Activations Density 0.027%