INDEX
Explanations
self-referential phrases and discussions about personal identity
New Auto-Interp
Negative Logits
éľŀ
-0.14
ivent
-0.14
oji
-0.14
veau
-0.14
lix
-0.14
TMPro
-0.14
hwnd
-0.13
á»
-0.13
δά
-0.13
llib
-0.13
POSITIVE LOGITS
love
0.41
LOVE
0.36
likes
0.33
love
0.33
loves
0.32
prefer
0.31
likes
0.30
tend
0.29
Love
0.28
åĸľæ¬¢
0.28
Activations Density 0.800%