INDEX
Explanations
phrases containing the word "attacked"
instances of the word "attacked."
New Auto-Interp
Negative Logits
Vert
-0.72
val
-0.65
YC
-0.63
tz
-0.63
shown
-0.61
atom
-0.61
sa
-0.61
vert
-0.61
flu
-0.61
aver
-0.61
POSITIVE LOGITS
attack
1.03
attacked
0.94
attacks
0.94
attackers
0.89
oise
0.89
attack
0.87
ritch
0.86
attacking
0.85
ivated
0.82
Attack
0.80
Activations Density 0.015%