INDEX
Explanations
personal pronouns and verbs related to decision-making or choice
New Auto-Interp
Negative Logits
ัà¸ķà¸ĸ
-0.15
almost
-0.14
Herm
-0.14
èĭ¥
-0.14
arton
-0.13
arges
-0.13
witter
-0.13
Tup
-0.13
elerik
-0.13
ÑĢой
-0.13
POSITIVE LOGITS
ever
0.19
ever
0.18
977
0.16
843
0.16
ças
0.16
×Ļ
0.15
997
0.15
EVER
0.15
denn
0.14
577
0.14
Activations Density 0.126%