INDEX
Explanations
questions and statements regarding the validity of claims and truths
New Auto-Interp
Negative Logits
kop
-0.15
Authority
-0.14
Unified
-0.14
.bio
-0.14
мÑı
-0.14
еÑĢеж
-0.13
van
-0.13
.attrs
-0.13
asm
-0.13
pedia
-0.13
POSITIVE LOGITS
true
0.84
true
0.76
TRUE
0.66
True
0.65
True
0.61
TRUE
0.60
truth
0.54
true
0.54
(true
0.52
verdade
0.52
Activations Density 0.317%