INDEX
Explanations
concepts related to deception and falsehoods
New Auto-Interp
Negative Logits
.scalablytyped
-0.24
both
-0.15
Both
-0.14
raquo
-0.14
Both
-0.14
both
-0.14
WEEN
-0.13
BOTH
-0.13
bole
-0.12
;č↵
-0.12
POSITIVE LOGITS
XYZ
0.42
xyz
0.35
XYZ
0.33
X
0.29
xyz
0.26
æŁIJ
0.26
say
0.25
say
0.25
tomorrow
0.24
ABC
0.24
Activations Density 0.782%