INDEX
Explanations
policing
The neuron activates on phrases referring to “moral policing” or related content‐policy warnings.
New Auto-Interp
Negative Logits
kapsamında
-0.07
κου
-0.07
Summer
-0.07
бюдж
-0.06
Rut
-0.06
Nights
-0.06
burning
-0.06
uncertainty
-0.06
conspic
-0.06
желуд
-0.06
POSITIVE LOGITS
policing
0.09
responsibly
0.09
grass
0.07
폐
0.06
فارس
0.06
ViewChild
0.06
ragments
0.06
resco
0.06
babys
0.06
داد
0.06
Activations Density 0.001%