INDEX
Explanations
expressions of gratitude or refusal
New Auto-Interp
Negative Logits
éłĨ
-0.15
hw
-0.15
bjerg
-0.14
soft
-0.14
ulado
-0.14
ysa
-0.14
acro
-0.14
ITT
-0.14
itt
-0.13
bast
-0.13
POSITIVE LOGITS
thank
0.29
Thank
0.23
thank
0.23
pref
0.23
Thank
0.22
è°¢
0.21
preference
0.19
prefer
0.18
THANK
0.17
prefer
0.17
Activations Density 0.178%