INDEX
Explanations
words related to restriction, censorship, or prohibition
terms related to blocking or censorship
New Auto-Interp
Negative Logits
ller
-0.73
lli
-0.73
brates
-0.72
brate
-0.72
llers
-0.70
gow
-0.70
rious
-0.69
ivil
-0.66
ria
-0.66
EMBER
-0.66
POSITIVE LOGITS
blocking
0.95
buster
0.90
busters
0.88
listed
0.80
blockers
0.80
chains
0.78
quote
0.78
aded
0.78
lights
0.78
ades
0.77
Activations Density 0.019%