INDEX
Explanations
words related to censorship and control of language
New Auto-Interp
Negative Logits
swick
-0.79
amac
-0.77
verty
-0.75
ilater
-0.75
ptoms
-0.73
docker
-0.70
ancial
-0.69
itness
-0.68
ndra
-0.67
erald
-0.66
POSITIVE LOGITS
cens
1.01
censorship
0.89
censor
0.85
censored
0.81
zers
0.74
cutter
0.71
levied
0.70
viol
0.69
promulg
0.68
cens
0.68
Activations Density 0.027%