INDEX
Explanations
expressions of contradiction or clarification in statements
New Auto-Interp
Negative Logits
ÙĪÙĪ
-0.18
uelle
-0.16
uel
-0.15
isel
-0.15
elin
-0.14
oldt
-0.14
certainly
-0.14
ogn
-0.14
Äĥr
-0.14
misdemean
-0.14
POSITIVE LOGITS
actually
0.22
actually
0.22
actual
0.21
actual
0.20
Actually
0.18
Actually
0.17
Actual
0.17
Actual
0.17
_actual
0.16
(actual
0.16
Activations Density 0.140%