INDEX
Explanations
references to the 9/11 attacks
references to the 9/11 attacks
New Auto-Interp
Negative Logits
Tile
-0.70
eva
-0.67
hunt
-0.64
maid
-0.63
pengu
-0.63
Jem
-0.63
laus
-0.62
Nun
-0.62
chance
-0.62
unman
-0.62
POSITIVE LOGITS
Syndrome
0.84
anniversary
0.83
Truth
0.76
abad
0.76
bombers
0.74
amara
0.74
truth
0.72
iets
0.72
bombings
0.72
Truth
0.71
Activations Density 0.024%