INDEX
Explanations
phrases expressing personal preferences or enjoyment
expressions of preference or liking
New Auto-Interp
Negative Logits
arta
-0.80
tek
-0.72
irrel
-0.71
krit
-0.70
cession
-0.69
ilion
-0.69
inary
-0.68
Login
-0.68
INAL
-0.64
PATH
-0.64
POSITIVE LOGITS
seeing
1.00
experimenting
0.91
watching
0.91
surprises
0.88
interacting
0.87
hearing
0.87
having
0.86
to
0.86
talking
0.79
ably
0.78
Activations Density 0.075%