INDEX
Explanations
refusing harmful content requests
New Auto-Interp
Negative Logits
meestal
1.08
biasanya
0.99
Usually
0.97
Usually
0.93
usually
0.92
Biasanya
0.91
usually
0.90
selalu
0.89
uguale
0.88
suelen
0.87
POSITIVE LOGITS
represents
2.34
represent
2.11
represents
2.04
Represents
1.85
rappresenta
1.79
raises
1.79
representa
1.76
constitutes
1.69
représente
1.64
Represent
1.63
Activations Density 0.652%