INDEX
Explanations
elements related to environmental and sustainability discussions
New Auto-Interp
Negative Logits
LEncoder
-0.80
ambilan
-0.79
يتيمه
-0.76
Italijanski
-0.75
دانشنامهٔ
-0.75
alphabetical
-0.73
aarrggbb
-0.73
$_"
-0.72
downvotes
-0.72
UnsafeEnabled
-0.72
POSITIVE LOGITS
[toxicity=0]
0.86
<bos>
0.79
↵
0.79
\
0.63
↵↵
0.59
↵↵↵
0.59
">)</
0.57
</tr>
0.52
"
0.52
)
0.52
Activations Density 0.050%