INDEX
Explanations
expressions of preference or affection
New Auto-Interp
Negative Logits
TestingModule
-0.74
Asher
-0.73
Hernandez
-0.72
genoux
-0.71
Carrasco
-0.70
Berman
-0.70
Pennington
-0.68
brazos
-0.67
шель
-0.67
se
-0.67
POSITIVE LOGITS
dislike
0.88
liked
0.85
Likes
0.83
Likes
0.82
Liked
0.80
Lik
0.79
liked
0.73
gusta
0.72
👍👍
0.72
dislike
0.71
Activations Density 0.047%