INDEX
Explanations
phrases expressing negation or the concept of exclusivity
New Auto-Interp
Negative Logits
sts
-0.17
illon
-0.16
inel
-0.15
lems
-0.15
atory
-0.14
bane
-0.14
jom
-0.14
ogan
-0.14
hub
-0.14
annes
-0.14
POSITIVE LOGITS
alone
0.49
Alone
0.41
alone
0.37
-alone
0.33
seule
0.26
sole
0.26
seul
0.24
å͝ä¸Ģ
0.24
solo
0.24
lone
0.23
Activations Density 0.045%