INDEX
Explanations
references to the concept of love and affection
New Auto-Interp
Negative Logits
ιαÏĤ
-0.15
urable
-0.15
iff
-0.15
loo
-0.14
averse
-0.14
PCM
-0.13
urch
-0.13
htub
-0.13
avors
-0.13
avers
-0.13
POSITIVE LOGITS
eliness
0.21
ania
0.19
renc
0.19
alker
0.17
Letter
0.16
esome
0.16
/right
0.15
åĦª
0.15
ardy
0.15
-fi
0.15
Activations Density 0.009%