INDEX
Explanations
rules and restrictions
The neuron detects tokens from a content‐policy refusal or “I’m sorry / I cannot fulfill this request” style apology/refusal statement.
New Auto-Interp
Negative Logits
layui
-0.07
ration
-0.07
अध
-0.07
Curso
-0.07
lần
-0.07
uncture
-0.07
/gr
-0.07
_prime
-0.06
θέση
-0.06
مبت
-0.06
POSITIVE LOGITS
ank
0.07
massa
0.06
selfie
0.06
UMAN
0.06
myModal
0.06
mor
0.06
Pussy
0.06
arlar
0.05
dataTable
0.05
mean
0.05
Activations Density 0.003%