INDEX
Explanations
references to various forms of toxicity
New Auto-Interp
Negative Logits
ArgumentParser
-0.97
Paglinawan
-0.96
iſt
-0.90
CppMethod
-0.90
Билгалдахарш
-0.89
berdayakan
-0.89
webElementXpaths
-0.88
Identyfik
-0.87
-0.87
tagHelperRunner
-0.86
POSITIVE LOGITS
[toxicity=0]
3.49
↵↵
1.14
↵
0.81
</h6>
0.80
]
0.79
toxicity
0.70
.
0.69
//
0.67
<eos>
0.66
)
0.66
Activations Density 0.052%