INDEX
Explanations
phrases indicating speech or expression of opinions
New Auto-Interp
Negative Logits
('-0.19
(“
-0.18
("-0.18
(«
-0.16
(`
-0.16
коÑĤоÑĢÑĭм
-0.14
owie
-0.14
.sz
-0.14
è½
-0.14
_DX
-0.13
POSITIVE LOGITS
,↵
0.29
,
0.25
,"
0.23
,↵↵
0.23
,”
0.23
:
0.21
,"↵
0.21
,č↵
0.20
,↵
0.18
,↵↵↵↵
0.18
Activations Density 0.207%