INDEX
Explanations
references to uncovering hidden truths or secrets
New Auto-Interp
Negative Logits
åѤ
-0.15
ë´ī
-0.15
å¡ļ
-0.15
acus
-0.14
çī¹èī²
-0.14
ÎŃÏģγ
-0.14
æŀĿ
-0.13
ronic
-0.13
itten
-0.13
Chance
-0.13
POSITIVE LOGITS
truth
0.64
truth
0.52
Truth
0.50
truths
0.49
secrets
0.49
true
0.48
Truth
0.45
verdad
0.41
secret
0.41
true
0.38
Activations Density 0.226%