INDEX
Explanations
themes related to oppression and societal control
New Auto-Interp
Negative Logits
jist
-0.15
UNUSED
-0.15
ügen
-0.15
eldo
-0.14
ilig
-0.14
odu
-0.13
kest
-0.13
жд
-0.13
ingo
-0.13
num
-0.13
POSITIVE LOGITS
perceived
0.28
daring
0.28
dared
0.28
æķ¢
0.26
slightest
0.26
dissent
0.25
critical
0.24
upp
0.23
outspoken
0.23
critical
0.22
Activations Density 0.350%