INDEX
Explanations
expressions indicating perception or observation
New Auto-Interp
Negative Logits
supposedly
-0.19
esian
-0.15
itol
-0.15
uner
-0.14
readcr
-0.14
presumably
-0.14
quets
-0.14
quet
-0.14
ialis
-0.13
aspiring
-0.13
POSITIVE LOGITS
ingly
0.29
lessly
0.28
like
0.28
intent
0.21
likes
0.19
intent
0.18
Like
0.17
like
0.17
Like
0.17
liked
0.17
Activations Density 0.040%