INDEX
Explanations
affirmations or acknowledgments of agreement
New Auto-Interp
Negative Logits
_ASSUME
-0.18
nici
-0.16
strup
-0.15
owler
-0.14
à¹ĥà¸Ķ
-0.14
Ñģом
-0.14
uentes
-0.14
olvers
-0.14
欲
-0.14
uffles
-0.14
POSITIVE LOGITS
AY
0.34
fine
0.33
ay
0.31
lahoma
0.29
ays
0.29
tober
0.27
fine
0.25
Fine
0.25
then
0.23
so
0.23
Activations Density 0.030%