INDEX
Explanations
emotional and nostalgic language related to personal memories and experiences
New Auto-Interp
Negative Logits
merit
-0.73
attest
-0.67
preference
-0.67
opath
-0.66
complement
-0.63
fide
-0.63
scapego
-0.63
depletion
-0.63
replacement
-0.63
privilege
-0.62
POSITIVE LOGITS
BUT
1.11
yet
1.06
until
1.05
why
1.05
oops
1.04
unless
0.97
but
0.96
except
0.96
they
0.94
that
0.93
Activations Density 0.042%