INDEX
Explanations
terms associated with various forms of censorship and control
New Auto-Interp
Negative Logits
↵
-0.22
ialis
-0.21
OrCreate
-0.19
coming
-0.18
↵ ↵
-0.18
orsi
-0.18
sv
-0.17
ERSHEY
-0.17
↵ ↵
-0.17
ologne
-0.17
POSITIVE LOGITS
wealth
0.19
ifornia
0.18
pillar
0.18
stalk
0.17
punk
0.16
members
0.16
=C
0.15
enne
0.15
vast
0.15
agne
0.15
Activations Density 1.154%