INDEX
Explanations
words related to deception and falsehoods
New Auto-Interp
Negative Logits
nels
-0.17
mente
-0.16
core
-0.16
rik
-0.15
.scalablytyped
-0.15
rost
-0.15
bos
-0.15
tle
-0.15
wholly
-0.15
ially
-0.15
POSITIVE LOGITS
ÌĪ
0.20
keepers
0.18
readcr
0.17
theast
0.17
xygen
0.17
yssey
0.17
ys
0.16
thing
0.16
otros
0.15
ãĤ©
0.15
Activations Density 0.559%