INDEX
Explanations
comments or annotations in code documentation
New Auto-Interp
Negative Logits
OGND
-0.81
zoude
-0.81
adeloupe
-0.78
genodigd
-0.77
mpagne
-0.76
UserScript
-0.76
stiefe
-0.75
laſſen
-0.74
ब्रेकडाउन
-0.73
للمعارف
-0.72
POSITIVE LOGITS
//
0.91
///
0.91
*
0.86
*
0.81
//
0.64
[toxicity=0]
0.64
//
0.63
#
0.56
0.55
+//
0.55
Activations Density 0.092%