INDEX
Explanations
words that express uncertainty or moderation in statements
New Auto-Interp
Negative Logits
s
-0.19
еÑī
-0.17
sport
-0.16
sing
-0.15
fty
-0.14
axter
-0.14
Ïĥμ
-0.14
isable
-0.14
izational
-0.14
odb
-0.13
POSITIVE LOGITS
ewhat
0.14
ebb
0.14
uario
0.13
StandardItem
0.13
CJK
0.13
Ĥ¨
0.13
/stdc
0.13
onth
0.12
º«
0.12
rag
0.12
Activations Density 0.012%