INDEX
Explanations
the specific term "Ass" with varying activation strengths
variations of the term "assassin" or related terms
New Auto-Interp
Negative Logits
çĦ
-0.72
AAF
-0.69
vation
-0.68
Tobacco
-0.64
Mercury
-0.63
WD
-0.63
Welsh
-0.63
poppy
-0.62
MET
-0.62
smoke
-0.61
POSITIVE LOGITS
assin
1.36
ociate
1.28
essor
1.27
sembly
1.21
ass
1.21
alam
1.16
oci
1.13
ortment
1.12
alon
1.09
essment
1.08
Activations Density 0.007%