INDEX
Explanations
assertions or claims about knowledge and truth in various contexts
New Auto-Interp
Negative Logits
ones
-0.17
.bz
-0.15
somehow
-0.15
odule
-0.14
ajas
-0.14
icator
-0.14
oggler
-0.14
aga
-0.14
als
-0.14
ular
-0.13
POSITIVE LOGITS
happening
0.22
Wrong
0.18
wrong
0.18
/loose
0.18
wrong
0.17
besides
0.16
Wrong
0.16
ÙĪÙħا
0.16
regarding
0.16
happened
0.16
Activations Density 0.227%