INDEX
Explanations
words related to strong positive emotions like joy, delight, and pleasure
expressions and references to joy and pleasure
New Auto-Interp
Negative Logits
Bans
-0.68
Canary
-0.65
braces
-0.64
heed
-0.61
pta
-0.60
pat
-0.60
adamant
-0.60
strict
-0.59
inx
-0.58
arin
-0.58
POSITIVE LOGITS
ride
1.22
ously
1.11
fully
1.10
ous
1.08
sticks
1.02
urable
0.97
iously
0.93
joy
0.92
fulness
0.88
urous
0.88
Activations Density 0.091%