INDEX
Explanations
mentioning specific details
New Auto-Interp
Negative Logits
Determine
0.66
Demonstrated
0.58
Understanding
0.55
demonstrated
0.55
Determining
0.52
för
0.51
Defender
0.51
𝟎
0.51
для
0.50
۔
0.50
POSITIVE LOGITS
t
0.79
erwäh
0.72
erwähnt
0.69
mention
0.67
Mention
0.66
提到的
0.66
К
0.63
mention
0.61
m
0.60
mencion
0.60
Activations Density 0.046%