INDEX
Explanations
behaviors and actions, particularly in social contexts
New Auto-Interp
Negative Logits
Easily
-0.18
easily
-0.18
erif
-0.16
imple
-0.15
ød
-0.14
bia
-0.14
gon
-0.14
onte
-0.14
aint
-0.14
pts
-0.14
POSITIVE LOGITS
differently
0.31
like
0.30
according
0.22
_like
0.21
Like
0.21
LIKE
0.20
наÑĩе
0.20
contrary
0.20
Like
0.20
manner
0.19
Activations Density 0.050%