INDEX
Explanations
phrases related to identity and classification
New Auto-Interp
Negative Logits
ä¹ĭä¸Ģ
-0.17
guy
-0.17
staffer
-0.17
.libs
-0.17
gangs
-0.16
ista
-0.15
Spells
-0.15
çļĦä¸Ģ个
-0.15
newcomer
-0.15
341
-0.15
POSITIVE LOGITS
themselves
0.40
condu
0.19
ones
0.18
stew
0.18
yourselves
0.18
masters
0.18
initi
0.18
asters
0.17
holders
0.17
catalyst
0.17
Activations Density 0.621%