INDEX
Explanations
words and names related to dominance and hierarchy
New Auto-Interp
Negative Logits
mong
-0.18
pery
-0.17
afil
-0.15
Kop
-0.15
dup
-0.15
bage
-0.15
monds
-0.15
ilen
-0.14
topics
-0.14
ëļ
-0.14
POSITIVE LOGITS
estic
0.25
åIJĪãĤıãģĽ
0.15
posit
0.15
á»ı
0.15
uko
0.15
ologne
0.14
ingu
0.14
ople
0.14
etri
0.14
uent
0.14
Activations Density 0.020%