INDEX
Explanations
references to specific cultural or social identifiers
New Auto-Interp
Negative Logits
iya
-0.18
ija
-0.17
patial
-0.17
iena
-0.16
rah
-0.16
å¥ī
-0.16
usters
-0.15
ÙĬÙĬÙĨ
-0.15
Mant
-0.15
iy
-0.15
POSITIVE LOGITS
emy
0.31
ÄĻ
0.30
Äħ
0.29
eli
0.26
enn
0.26
elib
0.26
emie
0.24
ÄĻż
0.24
enny
0.23
ÅĦ
0.23
Activations Density 0.015%