INDEX
Explanations
phrases expressing enjoyment or positive experiences
New Auto-Interp
Negative Logits
rage
-0.06
AKE
-0.06
/scripts
-0.06
Lage
-0.06
shr
-0.06
lassen
-0.06
wit
-0.05
gger
-0.05
wa
-0.05
lay
-0.05
POSITIVE LOGITS
BOSE
0.08
oriously
0.07
itesse
0.07
stras
0.07
_barrier
0.06
oenix
0.06
elines
0.06
eus
0.06
.sel
0.06
/sn
0.06
Activations Density 0.007%