INDEX
Explanations
phrases that express comparison or similarity
New Auto-Interp
Negative Logits
iman
-0.15
åłĤ
-0.15
aktu
-0.15
_SAFE
-0.14
obia
-0.14
iah
-0.14
åĩ½
-0.14
either
-0.13
ter
-0.13
nt
-0.13
POSITIVE LOGITS
many
0.27
other
0.23
any
0.23
most
0.22
many
0.21
许å¤ļ
0.21
elsewhere
0.21
åħ¶ä»ĸ
0.17
ä»»ä½ķ
0.17
everywhere
0.17
Activations Density 0.051%