INDEX
Explanations
references to deception, illusion, and misleading narratives
New Auto-Interp
Negative Logits
ensch
-0.16
alic
-0.16
WISE
-0.15
ãĤ¥
-0.15
ç±
-0.15
cess
-0.15
ØŃÙĪ
-0.14
"default
-0.14
Writes
-0.14
chamber
-0.14
POSITIVE LOGITS
rzy
0.17
urb
0.16
char
0.15
orraine
0.15
izard
0.15
illusion
0.15
ünst
0.14
itzer
0.14
chet
0.14
elden
0.14
Activations Density 0.112%