INDEX
Explanations
terms related to censorship and blocking
terms related to censorship and its implications
New Auto-Interp
Negative Logits
ilater
-0.76
itness
-0.74
verty
-0.73
ndra
-0.73
amac
-0.72
docker
-0.71
swick
-0.70
ptoms
-0.67
ancial
-0.67
ammad
-0.66
POSITIVE LOGITS
cens
0.91
censorship
0.85
censor
0.77
censored
0.76
zers
0.74
jing
0.72
orious
0.69
levied
0.68
monkey
0.64
cens
0.64
Activations Density 0.036%