INDEX
Explanations
phrases related to deception or illusion
terms related to metaphorical representations and abstract concepts
New Auto-Interp
Negative Logits
reditary
-0.98
igree
-0.88
rences
-0.84
aeper
-0.83
arya
-0.78
ogun
-0.77
uled
-0.76
gments
-0.76
raviolet
-0.76
uilt
-0.75
POSITIVE LOGITS
女
0.93
ãĥ¢
0.80
fish
0.80
phony
0.74
ãĥķãĤ¡
0.71
gad
0.70
fig
0.68
crop
0.66
ãĥ¡
0.65
————————————————
0.64
Activations Density 0.028%