INDEX
Explanations
negations or expressions of disagreement
New Auto-Interp
Negative Logits
SEDS
-0.72
Roskov
-0.64
词
-0.63
الرياضيه
-0.62
Hentet
-0.61
Parac
-0.58
IGENCE
-0.57
']}
-0.56
."],
-0.56
()].
-0.56
POSITIVE LOGITS
no
0.63
Nope
0.63
nope
0.61
No
0.61
NO
0.61
Nope
0.59
🙅
0.59
nope
0.59
فريبيس
0.58
No
0.57
Activations Density 0.061%