INDEX
Explanations
expressions related to preferences and opinions
New Auto-Interp
Negative Logits
клопе
-0.58
xodo
-0.54
Righteous
-0.51
transQ
-0.50
躇
-0.49
-0.47
AFA
-0.47
audiovisuel
-0.46
uuidv
-0.46
mourut
-0.46
POSITIVE LOGITS
liked
2.00
loved
1.78
liking
1.76
love
1.72
loves
1.64
likes
1.56
enjoyed
1.54
liked
1.51
enjoy
1.47
LOVED
1.46
Activations Density 0.309%