INDEX
Explanations
phrases that express a preference or comparison
New Auto-Interp
Negative Logits
sci
-0.17
imizer
-0.16
sch
-0.16
erdale
-0.16
uring
-0.15
sb
-0.15
system
-0.15
ald
-0.15
sw
-0.15
san
-0.15
POSITIVE LOGITS
ìĦľëĬĶ
0.18
than
0.16
ìĦľ
0.16
icher
0.15
ÙĨÚ¯ÛĮ
0.15
-sex
0.15
much
0.15
-than
0.15
rière
0.14
ODE
0.14
Activations Density 0.018%