INDEX
Explanations
comparisons and contrasts within contexts or situations
New Auto-Interp
Negative Logits
orre
-0.17
kontro
-0.15
ее
-0.15
/^(
-0.15
ntag
-0.14
ocate
-0.14
ader
-0.14
plat
-0.14
anded
-0.14
ADER
-0.14
POSITIVE LOGITS
same
0.30
缸åIJĮ
0.28
same
0.28
identical
0.28
Same
0.26
Same
0.25
unchanged
0.24
similar
0.22
SAME
0.22
ä¸Ģæł·
0.21
Activations Density 0.218%