INDEX
Explanations
words related to misconceptions, false beliefs, and deception
concepts related to illusions and delusions
New Auto-Interp
Negative Logits
atories
-0.72
RC
-0.69
uter
-0.65
ĵ
-0.65
iary
-0.64
utor
-0.64
ï¸ı
-0.63
RAFT
-0.62
rc
-0.62
received
-0.61
POSITIVE LOGITS
illusion
3.28
illusions
2.99
Illusion
2.32
delusion
2.02
illusion
1.85
delusions
1.60
mir
1.31
hallucinations
1.29
impressions
1.27
halluc
1.23
Activations Density 0.029%