INDEX
Explanations
questions or phrases expressing the extent of emotions or experiences
New Auto-Interp
Negative Logits
nist
-0.15
erta
-0.15
igli
-0.15
åīĽ
-0.14
whose
-0.14
æľĢä½³
-0.14
Ukra
-0.14
preferably
-0.14
æĺ¯åIJ¦
-0.13
imary
-0.13
POSITIVE LOGITS
much
0.26
much
0.24
Much
0.21
itzer
0.19
important
0.19
Much
0.19
wrong
0.18
little
0.18
atta
0.17
lucky
0.16
Activations Density 0.044%