INDEX
Explanations
words emphasizing equality, community, and the importance of all individuals
New Auto-Interp
Negative Logits
ãĥĥãĤ·ãĥ¥
-0.19
eum
-0.18
ifr
-0.15
sted
-0.14
dat
-0.14
uja
-0.14
stad
-0.14
stab
-0.14
ÅĻÃŃd
-0.14
ofType
-0.13
POSITIVE LOGITS
ÑĸнÑĮ
0.15
rieve
0.15
ibold
0.15
ayi
0.14
REFERRED
0.14
adir
0.14
/testify
0.13
365
0.13
reative
0.13
neh
0.13
Activations Density 0.708%