INDEX
Explanations
phrases expressing personal preferences and likes
expressions of preference or liking
New Auto-Interp
Negative Logits
voy
-0.70
empt
-0.68
ueless
-0.66
Args
-0.65
SPONSORED
-0.63
rogens
-0.62
emer
-0.62
AIDS
-0.61
idal
-0.61
eding
-0.61
POSITIVE LOGITS
76561
0.86
myself
0.80
lihood
0.79
dearly
0.76
poke
0.73
compliments
0.73
seeing
0.72
fully
0.72
Fine
0.71
66666666
0.71
Activations Density 0.093%