INDEX
Explanations
expressions of gratitude and encouragement
New Auto-Interp
Negative Logits
↵
-0.18
Fucking
-0.17
FUCK
-0.17
Fuck
-0.16
fucks
-0.15
Fuck
-0.15
fuck
-0.15
·
-0.15
fucking
-0.15
:↵
-0.14
POSITIVE LOGITS
glad
0.20
indeed
0.20
Indeed
0.18
Indeed
0.17
appreciate
0.17
yes
0.17
agree
0.17
Glad
0.17
inde
0.16
apprec
0.16
Activations Density 0.079%