INDEX
Explanations
negations and contrasts in context
New Auto-Interp
Negative Logits
Lyon
-0.15
plit
-0.15
odos
-0.14
Narrow
-0.14
oney
-0.14
dos
-0.14
quette
-0.14
iyah
-0.14
dit
-0.14
ç¤
-0.14
POSITIVE LOGITS
vero
0.18
sett
0.17
arend
0.15
æľĹ
0.15
ECTOR
0.14
олÑİ
0.14
IVERS
0.14
ÙĦÙĬÙģ
0.14
æķ
0.14
rve
0.13
Activations Density 0.048%