INDEX
Explanations
references to feelings of dependence or addiction related to substances
New Auto-Interp
Negative Logits
Билгалдахарш
-1.02
tartalomajánló
-0.93
Majefty
-0.89
Personensuche
-0.88
دانشنامهٔ
-0.88
Autoritní
-0.87
MigrationBuilder
-0.87
$_"
-0.85
myſelf
-0.85
ProtoMessage
-0.85
POSITIVE LOGITS
[toxicity=0]
1.02
<
1.00
Q
0.88
Q
0.88
[
0.86
0.83
<
0.82
0.79
[
0.78
↵
0.73
Activations Density 0.753%