INDEX
Explanations
references to emotional states or psychological conditions
New Auto-Interp
Negative Logits
.
-0.32
↵
-0.27
,
-0.26
-0.26
.↵
-0.24
,
-0.22
p
-0.22
a
-0.22
(
-0.22
:
-0.22
POSITIVE LOGITS
галÑĸ
0.41
лÑĸ
0.36
вÑĸ
0.35
ÑĢÑĸ
0.35
нÑĸ
0.33
енÑĸ
0.33
ÑĤÑĸ
0.33
елÑĸ
0.33
Òij
0.33
ÑĶ
0.32
Activations Density 0.034%