INDEX
Explanations
indicative phrases and statements that express decision-making or personal opinions
New Auto-Interp
Negative Logits
eton
-0.17
aos
-0.16
auge
-0.15
Cha
-0.15
ζ
-0.15
γκ
-0.14
¶
-0.14
ascimento
-0.14
plode
-0.14
rzy
-0.14
POSITIVE LOGITS
instead
0.25
Instead
0.20
Instead
0.19
instead
0.18
anter
0.15
вмеÑģÑĤ
0.14
asco
0.14
çļĦæĺ¯
0.14
only
0.14
Hack
0.14
Activations Density 0.273%