INDEX
Explanations
expressions of liking or affection
New Auto-Interp
Negative Logits
Asher
-0.75
èvre
-0.72
Hernandez
-0.71
Pennington
-0.68
Berman
-0.68
Crowe
-0.67
<h6>
-0.65
шель
-0.63
codegen
-0.63
din
-0.62
POSITIVE LOGITS
liked
0.96
Liked
0.92
dislike
0.88
Likes
0.86
Likes
0.84
gusta
0.84
Lik
0.81
dislike
0.81
liking
0.78
👍👍
0.77
Activations Density 0.049%