INDEX
Explanations
any of decisions or harmful
New Auto-Interp
Negative Logits
Some
0.60
It
0.60
I
0.59
íss
0.59
'।
0.56
’
0.56
Needed
0.55
。
0.55
SOME
0.55
’।
0.54
POSITIVE LOGITS
ס
0.70
ור
0.66
든지
0.64
THING
0.61
of
0.57
новую
0.55
ד
0.55
parecido
0.53
of
0.52
were
0.52
Activations Density 0.067%