INDEX
Explanations
phrases that express preference or contrast
New Auto-Interp
Negative Logits
urator
-0.17
urious
-0.16
rav
-0.16
ury
-0.16
elly
-0.15
sw
-0.15
$$$$
-0.14
erdale
-0.14
Č
-0.14
uring
-0.14
POSITIVE LOGITS
instead
0.19
-than
0.19
than
0.17
-known
0.15
ÙĦ
0.15
erro
0.15
445
0.15
ÙĨÚ¯ÛĮ
0.15
.consume
0.15
apy
0.14
Activations Density 0.018%