INDEX
Explanations
references to categories of societal structures and influential figures
New Auto-Interp
Negative Logits
serter
-0.15
unused
-0.15
ãĥ¼ãĥľ
-0.15
ohen
-0.15
á»ĭp
-0.15
RuleContext
-0.14
óa
-0.14
боÑĤ
-0.14
اØŃÙĦ
-0.14
olib
-0.14
POSITIVE LOGITS
as
0.35
как
0.20
quanto
0.19
ãģªãĤī
0.18
als
0.17
than
0.17
sebagai
0.17
ong
0.16
että
0.16
ÙĥÙħا
0.14
Activations Density 0.055%