INDEX
Explanations
Answers/comments
This neuron activates on content that triggers the model’s refusal policy—detecting disallowed or unethical requests and the refusal phrases (e.g. “cannot,” “recommend,” “illegal,” “assist”) used to decline them.
New Auto-Interp
Negative Logits
JSImport
-0.06
actual
-0.06
راست
-0.06
-0.06
.findElement
-0.06
-0.06
linger
-0.06
кими
-0.06
τι
-0.06
獲
-0.06
POSITIVE LOGITS
и
0.07
.GetText
0.07
rallying
0.06
Predictor
0.06
PropertyChanged
0.06
launch
0.06
=None
0.06
inter
0.06
\C
0.06
Txt
0.06
Activations Density 0.059%