INDEX
Explanations
phrases expressing positivity or admiration
positive sentiments related to being able to engage with others and participate joyfully
New Auto-Interp
Negative Logits
olicy
-0.74
uria
-0.73
urred
-0.71
ourse
-0.69
mage
-0.69
icipated
-0.68
cum
-0.67
è£
-0.67
é¾
-0.65
heid
-0.64
POSITIVE LOGITS
tid
0.80
noon
0.70
ya
0.68
remind
0.67
symmetry
0.66
ðŁĻĤ
0.66
buddy
0.66
knowing
0.66
congr
0.66
reminder
0.66
Activations Density 0.188%