INDEX
Explanations
the model's refusal to generate harmful, unethical, or inappropriate content.
New Auto-Interp
Negative Logits
gives
0.58
provides
0.58
denotes
0.57
merupakan
0.55
adalah
0.54
。
0.53
Provides
0.53
constitutes
0.52
generates
0.52
requires
0.52
POSITIVE LOGITS
However
1.05
But
0.96
Therefore
0.95
Consequently
0.84
但是
0.84
BUT
0.83
therefore
0.81
Furthermore
0.80
Moreover
0.79
However
0.78
Activations Density 1.199%