INDEX
Explanations
the word "like" followed by different scenarios or preferences expressed in the text
expressions of dislike or aversion
New Auto-Interp
Negative Logits
iggins
-0.84
rontal
-0.83
utical
-0.81
yrinth
-0.81
itech
-0.79
owa
-0.76
nown
-0.75
verty
-0.74
oided
-0.73
olon
-0.73
POSITIVE LOGITS
anymore
0.96
anybody
0.88
nor
0.87
anything
0.80
anyone
0.79
any
0.76
ably
0.75
surprises
0.71
censorship
0.67
bullies
0.66
Activations Density 0.048%