INDEX
Explanations
phrases indicating preference or habitual choices
New Auto-Interp
Negative Logits
Opport
-0.17
imbus
-0.16
curring
-0.15
omik
-0.15
pls
-0.14
Geg
-0.14
PlzeÅĪ
-0.14
267
-0.14
onian
-0.14
istically
-0.14
POSITIVE LOGITS
-to
0.28
go
0.27
-go
0.25
-To
0.23
(go
0.19
thic
0.18
go
0.18
oose
0.17
Go
0.17
.go
0.16
Activations Density 0.031%