INDEX
Explanations
trick question
New Auto-Interp
Negative Logits
is
0.90
it
0.84
ле
0.79
ال
0.77
να
0.75
ра
0.71
س
0.70
ad
0.67
бо
0.67
ме
0.67
POSITIVE LOGITS
’
0.91
-
0.83
↵
0.81
écart
0.77
jeopard
0.76
welches
0.68
catheters
0.66
DI
0.64
adı
0.63
cheating
0.63
Activations Density 1.992%