INDEX
Explanations
words related to deception and disguises
references to illusions and disguises
New Auto-Interp
Negative Logits
uled
-0.72
aldi
-0.69
orough
-0.69
Ü
-0.66
capacity
-0.65
vez
-0.64
acid
-0.63
Issues
-0.63
olved
-0.62
iaries
-0.62
POSITIVE LOGITS
deceive
1.10
disgu
1.00
mir
0.94
illusion
0.90
deception
0.85
disguise
0.84
camoufl
0.84
pas
0.82
querade
0.81
Illusion
0.80
Activations Density 0.070%