INDEX
Explanations
safety-focused refusals that empathetically redirect from harmful or inappropriate requests and offer supportive guidance and crisis resources instead of compliance.
New Auto-Interp
Negative Logits
buffs
0.92
industrialists
0.86
galera
0.84
merchants
0.81
suka
0.80
big
0.80
amateurs
0.80
stal
0.79
popul
0.78
consumers
0.77
POSITIVE LOGITS
hopelessness
1.11
psychotherapy
1.08
compassion
0.97
trauma
0.97
compassionate
0.96
counseling
0.93
PTSD
0.93
grieve
0.93
emotionally
0.93
healing
0.92
Activations Density 2.784%