INDEX
Explanations
instances where attention is being discussed or emphasized
mentions of attention and its implications or effects
New Auto-Interp
Negative Logits
Tale
-0.70
Yugoslavia
-0.65
Rebell
-0.63
Yar
-0.63
UES
-0.62
Dani
-0.62
lins
-0.61
halves
-0.60
Rouge
-0.60
Harbour
-0.60
POSITIVE LOGITS
estinal
0.88
spans
0.87
span
0.86
orial
0.85
attention
0.83
ively
0.79
largeDownload
0.79
bender
0.79
seeker
0.78
seekers
0.77
Activations Density 0.028%