INDEX
Explanations
phrases indicating a sense of complexity or contradiction
New Auto-Interp
Negative Logits
ellig
-0.15
JKLMNOP
-0.15
avl
-0.14
ADX
-0.14
.dense
-0.14
argin
-0.14
ovsky
-0.14
.lst
-0.14
ermann
-0.14
inalg
-0.13
POSITIVE LOGITS
Sala
0.16
nature
0.15
enough
0.14
Nature
0.14
ness
0.14
SAC
0.14
stalk
0.14
wend
0.14
sac
0.13
stalk
0.13
Activations Density 0.284%