INDEX
Explanations
words related to psychological states or emotions
terms related to themes of experimentation and exploitation
New Auto-Interp
Negative Logits
hips
-0.93
heny
-0.89
ento
-0.84
nings
-0.83
itu
-0.83
sen
-0.81
ingham
-0.77
iating
-0.77
IFE
-0.76
yss
-0.76
POSITIVE LOGITS
grab
0.79
ploy
0.77
cipher
0.75
hawk
0.73
whore
0.73
brake
0.73
bunny
0.73
glove
0.72
hatch
0.72
porn
0.72
Activations Density 0.301%