INDEX
Explanations
phrases emphasizing exclusivity or contrast
New Auto-Interp
Negative Logits
anche
-0.20
kur
-0.17
åŁ
-0.16
kart
-0.16
ernes
-0.15
arrant
-0.15
apiro
-0.14
odor
-0.14
ipt
-0.14
ruku
-0.14
POSITIVE LOGITS
ABCDEFGHIJKLMNOP
0.15
phe
0.14
gee
0.14
sob
0.14
arius
0.14
ools
0.14
Filtered
0.14
mmc
0.14
Ñī
0.14
/all
0.14
Activations Density 0.023%