INDEX
Explanations
censorship
The main thing this neuron does is detect mentions of content restrictions—words like “censorship,” “filtering,” or related moderation terms.
New Auto-Interp
Negative Logits
Buscar
-0.06
VIP
-0.06
fract
-0.06
Cs
-0.06
Proxy
-0.06
�
-0.06
playlists
-0.06
悟
-0.06
Ils
-0.05
(Mock
-0.05
POSITIVE LOGITS
�
0.07
=path
0.07
POSSIBILITY
0.06
oined
0.06
wonderfully
0.06
Sofa
0.06
_problem
0.06
$template
0.06
RTVF
0.06
нитель
0.06
Activations Density 0.003%