INDEX
Explanations
vocabulary related to censorship or filtering
terms related to censorship and its implications
New Auto-Interp
Negative Logits
Scotia
-0.64
ordinary
-0.63
amaz
-0.60
holding
-0.59
sheet
-0.59
zzi
-0.58
Tinker
-0.58
addafi
-0.57
agne
-0.56
McAuliffe
-0.55
POSITIVE LOGITS
chen
0.98
orship
0.97
ource
0.96
manship
0.94
hift
0.92
terday
0.91
urable
0.90
wear
0.88
CRIP
0.85
haw
0.85
Activations Density 0.050%