INDEX
Explanations
harmful or exploitative content
New Auto-Interp
Negative Logits
(
0.79
<h1>
0.75
Table
0.72
Draw
0.70
—
0.70
(
0.69
See
0.67
---
0.67
View
0.65
#
0.64
POSITIVE LOGITS
LEC
0.88
Oversight
0.87
incapacity
0.86
<unused1888>
0.86
<unused368>
0.85
مذہبی
0.84
russe
0.83
<unused1044>
0.83
<unused2145>
0.83
hating
0.83
Activations Density 0.350%