INDEX
Explanations
phrases indicating exceptions, contradictions, or nuanced arguments
New Auto-Interp
Negative Logits
ibur
-0.15
ãi
-0.13
597
-0.13
foy
-0.13
sẵn
-0.13
ذ
-0.13
तम
-0.13
stup
-0.12
ipop
-0.12
æ´ĭ
-0.12
POSITIVE LOGITS
true
0.68
true
0.55
TRUE
0.47
True
0.46
True
0.41
TRUE
0.41
.true
0.40
true
0.40
untrue
0.38
(true
0.37
Activations Density 0.106%