INDEX
Explanations
descriptions of personality traits and social interactions
New Auto-Interp
Negative Logits
,
-0.32
[],
-0.24
ा,
-0.23
à¥ĩ,
-0.23
++,
-0.22
*,
-0.22
®,
-0.22
(),
-0.21
à¥ĭà¤Ĥ,
-0.21
+,
-0.21
POSITIVE LOGITS
but
0.25
and
0.19
but
0.18
sondern
0.15
which
0.14
etc
0.14
and
0.14
.and
0.14
or
0.14
które
0.14
Activations Density 0.978%