INDEX
Explanations
words related to questioning and contradictions
New Auto-Interp
Negative Logits
andal
-0.15
shouldBe
-0.15
(!!
-0.15
/tiny
-0.13
rium
-0.13
theres
-0.13
/misc
-0.13
?=.*
-0.12
ÑģеÑĢ
-0.12
ohen
-0.12
POSITIVE LOGITS
not
0.69
nicht
0.67
tidak
0.60
niet
0.59
không
0.58
नह
0.57
не
0.57
não
0.56
ikke
0.56
NOT
0.54
Activations Density 1.354%