INDEX
Explanations
references to censorship and banned works
New Auto-Interp
Negative Logits
889
-0.15
loff
-0.15
abyrin
-0.15
pev
-0.15
íħ
-0.15
@student
-0.14
igne
-0.14
Baghd
-0.14
uvw
-0.14
abandonment
-0.14
POSITIVE LOGITS
censorship
0.44
censor
0.43
c
0.34
ensor
0.33
ensored
0.31
cen
0.26
ensors
0.25
banning
0.23
bans
0.23
ban
0.23
Activations Density 0.075%