INDEX
Explanations
concepts related to free speech and censorship
New Auto-Interp
Negative Logits
double
-0.15
ierge
-0.14
оÑĩ
-0.14
tele
-0.14
.transfer
-0.14
Compound
-0.14
ouz
-0.14
/tasks
-0.14
_IPV
-0.13
ozo
-0.13
POSITIVE LOGITS
prov
0.17
Freedom
0.16
stap
0.16
åĻ
0.16
freedom
0.16
xon
0.15
Freedom
0.15
inspace
0.15
shutting
0.14
censorship
0.14
Activations Density 0.135%