INDEX
Explanations
phrases emphasizing mutual support and connection among individuals
New Auto-Interp
Negative Logits
uci
-0.78
0002
-0.70
lam
-0.64
hift
-0.63
reimb
-0.63
ariat
-0.61
°
-0.61
alty
-0.60
nit
-0.60
DK
-0.60
POSITIVE LOGITS
selves
0.92
worldly
0.83
individually
0.80
self
0.74
equally
0.71
heric
0.70
offensively
0.68
é¾įåĸļ士
0.68
anguages
0.68
mutually
0.67
Activations Density 0.009%