INDEX
Explanations
elements related to societal control and censorship
New Auto-Interp
Negative Logits
orsch
-0.15
jed
-0.14
chung
-0.13
ereg
-0.13
Patch
-0.13
.gz
-0.13
жд
-0.13
еÑī
-0.13
важа
-0.13
ÏĮÏĦηÏĦα
-0.12
POSITIVE LOGITS
dared
0.33
daring
0.32
dissent
0.32
dare
0.32
upp
0.29
disple
0.28
inconvenient
0.27
challenge
0.27
æķ¢
0.26
disagree
0.25
Activations Density 0.201%