INDEX
Explanations
behaviors related to social interactions and the notion of acting in various contexts
New Auto-Interp
Negative Logits
Easily
-0.19
easily
-0.17
æĹı
-0.14
[NUM
-0.13
accur
-0.13
accuracy
-0.13
aset
-0.13
etak
-0.13
ãĥ¥
-0.13
easiest
-0.13
POSITIVE LOGITS
aul
0.28
uate
0.28
like
0.27
/react
0.26
upon
0.25
ully
0.24
uated
0.21
liked
0.21
contrary
0.20
/respond
0.20
Activations Density 0.041%