INDEX
Explanations
references to societal and systemic structures or effects
New Auto-Interp
Negative Logits
ilot
-0.16
oma
-0.16
iano
-0.15
kop
-0.14
fa
-0.14
usion
-0.14
xis
-0.14
tes
-0.14
internal
-0.14
already
-0.14
POSITIVE LOGITS
lors
0.17
ãģĬ
0.16
ÙĩÙĨگاÙħ
0.15
lessly
0.15
ÄįnÄĽ
0.14
148
0.14
/Math
0.14
during
0.14
rug
0.13
czas
0.13
Activations Density 0.003%