INDEX
Explanations
personal pronouns in the text
New Auto-Interp
Negative Logits
lete
-0.07
èģĺ
-0.07
{}_-0.06
ovit
-0.06
اÙĤÙĦ
-0.06
alsex
-0.06
lite
-0.06
.utilities
-0.06
LOUD
-0.06
irts
-0.06
POSITIVE LOGITS
857
0.07
opp
0.07
eso
0.07
rganization
0.07
pert
0.06
neutr
0.06
cert
0.06
iver
0.06
uator
0.06
ough
0.06
Activations Density 0.036%