INDEX
Explanations
phrases or terms related to different types of characters or personal identities
New Auto-Interp
Negative Logits
ÑĬ
-0.24
ÑĮÑı
-0.24
ìľ¼ë¡ľ
-0.21
i
-0.20
Ь
-0.20
ÑĮÑİ
-0.20
ам
-0.19
ами
-0.19
ом
-0.18
ÑĮе
-0.18
POSITIVE LOGITS
нка
0.31
й
0.29
нд
0.27
Ìģ
0.27
нки
0.27
м
0.26
н
0.26
нг
0.25
лÑĮ
0.24
нÑĤ
0.23
Activations Density 0.041%