INDEX
Explanations
words related to excitement or enjoyment
New Auto-Interp
Negative Logits
gra
-0.17
yyy
-0.17
olis
-0.17
tt
-0.15
yyyy
-0.15
y
-0.15
sur
-0.14
ytt
-0.14
veral
-0.14
sol
-0.14
POSITIVE LOGITS
ey
0.23
igans
0.20
chie
0.20
ze
0.18
peed
0.18
iple
0.17
zers
0.17
porno
0.17
zy
0.16
ie
0.16
Activations Density 0.016%