INDEX
Explanations
references to censorship and the implications of free speech
New Auto-Interp
Negative Logits
stim
-0.15
CRM
-0.15
abandonment
-0.14
ÑĢÑİ
-0.14
859
-0.14
778
-0.14
resizing
-0.14
Disposition
-0.13
abyrin
-0.13
uyết
-0.13
POSITIVE LOGITS
censor
0.52
censorship
0.51
c
0.35
ensor
0.34
ensored
0.33
ÑĨ
0.28
Âłc
0.27
blocked
0.27
bans
0.26
ban
0.25
Activations Density 0.169%