INDEX
Explanations
statements and actions related to political accountability and critique
New Auto-Interp
Negative Logits
stol
-0.18
δά
-0.17
celed
-0.16
пÑĢизна
-0.14
_TD
-0.14
ÙģÙĪØª
-0.14
annes
-0.14
potvr
-0.14
ZR
-0.14
_FT
-0.14
POSITIVE LOGITS
critique
0.35
criticism
0.31
reb
0.30
critiques
0.30
repro
0.30
Crit
0.29
critic
0.28
criticize
0.28
crit
0.27
crit
0.27
Activations Density 0.459%