INDEX
Explanations
phrases indicating roles or identities within a community or organization
New Auto-Interp
Negative Logits
å·Ŀ
-0.16
kowski
-0.14
amera
-0.14
inda
-0.14
uda
-0.14
onen
-0.13
own
-0.13
ieu
-0.13
Äįek
-0.13
.em
-0.13
POSITIVE LOGITS
unc
0.17
ighth
0.17
ervers
0.16
humans
0.16
-пÑĢав
0.15
lue
0.15
pike
0.14
igham
0.14
ufs
0.14
fall
0.13
Activations Density 0.042%