INDEX
Explanations
references to specific groups or categories of individuals
New Auto-Interp
Negative Logits
entire
-0.18
etta
-0.15
Když
-0.15
æķ´ä¸ª
-0.15
.rdf
-0.15
whole
-0.14
ÑĪин
-0.14
åijĬ
-0.14
cestor
-0.14
[++
-0.13
POSITIVE LOGITS
only
0.22
Only
0.18
only
0.17
ones
0.17
NONE
0.16
åıªæľī
0.16
NONE
0.15
none
0.15
ONLY
0.15
Only
0.14
Activations Density 0.050%