INDEX
Explanations
references to loved ones and familial affection
New Auto-Interp
Negative Logits
ViewFeatures
-0.79
l
-0.70
y
-0.64
1
-0.62
i
-0.62
3
-0.61
McClure
-0.61
Tor
-0.61
Kondo
-0.60
口
-0.60
POSITIVE LOGITS
loved
1.87
Loved
1.84
Loved
1.63
LOVED
1.62
loved
1.50
liked
1.22
gelieb
1.21
ſelves
1.19
uſed
1.19
pleaſure
1.15
Activations Density 0.104%