INDEX
Explanations
information related to personal biographies or life events
New Auto-Interp
Negative Logits
ongyang
-0.82
ramid
-0.79
ength
-0.71
goodness
-0.67
oning
-0.64
erity
-0.63
iology
-0.63
idth
-0.62
inately
-0.62
inarily
-0.61
POSITIVE LOGITS
acquainted
1.10
accustomed
1.09
entangled
1.03
embroiled
1.03
extinct
1.01
aware
1.00
addicted
0.91
obsessed
0.90
disillusion
0.87
eligible
0.87
Activations Density 0.055%