INDEX
Explanations
negative statements or expressions of doubt
New Auto-Interp
Negative Logits
propOrder
-0.76
autorytatywna
-0.67
AssemblyCulture
-0.65
EDEFAULT
-0.64
ArrowToggle
-0.61
хьтан
-0.57
ffilmiau
-0.57
aarrggbb
-0.56
UIControlState
-0.56
Italijani
-0.56
POSITIVE LOGITS
[toxicity=0]
0.82
Q
0.66
Q
0.49
<
0.48
</blockquote>
0.47
[
0.46
Hope
0.45
<strong>
0.45
0.44
toxicity
0.44
Activations Density 1.125%