INDEX
Explanations
expressions of love and affection
New Auto-Interp
Negative Logits
Conſ
-0.64
ء
-0.62
himſelf
-0.61
Nuovo
-0.60
Unnamed
-0.59
syscall
-0.59
Dons
-0.59
Majefty
-0.58
ſame
-0.57
themſelves
-0.57
POSITIVE LOGITS
love
0.84
ValueStyle
0.82
dislike
0.78
hate
0.77
loves
0.77
loved
0.74
senang
0.74
liked
0.74
hated
0.72
hates
0.72
Activations Density 0.103%