INDEX
Explanations
comparative phrases that assess superiority or inferiority
New Auto-Interp
Negative Logits
Prev
-0.15
iden
-0.15
ker
-0.15
æ··åIJĪ
-0.14
_PICTURE
-0.14
Rubio
-0.14
illus
-0.14
urga
-0.14
stagram
-0.14
ä¸įäºĨ
-0.14
POSITIVE LOGITS
original
0.22
originals
0.20
original
0.19
direct
0.19
ORIGINAL
0.18
åİŁå§ĭ
0.18
direct
0.18
оÑĢиг
0.17
缴æİ¥
0.17
straight
0.17
Activations Density 0.005%