INDEX
Explanations
phrases indicating belonging or membership within a group
New Auto-Interp
Negative Logits
oun
-0.16
.scalablytyped
-0.15
931
-0.15
osed
-0.15
obia
-0.14
hoff
-0.14
kır
-0.14
uler
-0.14
zas
-0.14
ünd
-0.14
POSITIVE LOGITS
few
0.20
Few
0.16
ema
0.15
maal
0.15
strup
0.15
many
0.15
åĮ
0.14
apo
0.14
_many
0.14
fier
0.14
Activations Density 0.120%