INDEX
Explanations
words indicating perception or appearance
New Auto-Interp
Negative Logits
BOOLE
-0.15
*dt
-0.14
aben
-0.14
taire
-0.13
andes
-0.13
anela
-0.13
etten
-0.13
.Suppress
-0.13
abant
-0.13
theValue
-0.13
POSITIVE LOGITS
like
0.52
Like
0.47
Like
0.41
like
0.39
LIKE
0.36
_like
0.34
.like
0.33
likes
0.30
wie
0.30
como
0.29
Activations Density 0.010%