INDEX
Explanations
words and phrases related to preferences and choices
New Auto-Interp
Negative Logits
angelo
-0.18
quee
-0.16
romo
-0.16
urgy
-0.15
inea
-0.15
spiel
-0.15
790
-0.15
elsing
-0.14
rapy
-0.14
suming
-0.14
POSITIVE LOGITS
entially
0.39
ential
0.36
ably
0.22
encing
0.18
renc
0.18
prefer
0.17
lag
0.17
ensi
0.17
idian
0.17
enced
0.16
Activations Density 0.027%