INDEX
Explanations
references to attention or attentiveness
New Auto-Interp
Negative Logits
лиÑĨ
-0.20
idding
-0.17
ierge
-0.17
zes
-0.17
abra
-0.16
off
-0.16
utz
-0.15
chte
-0.15
offs
-0.15
ould
-0.15
POSITIVE LOGITS
itudes
0.29
orney
0.27
itude
0.26
itud
0.24
orneys
0.24
ENTION
0.24
uned
0.23
acks
0.23
acked
0.21
acking
0.21
Activations Density 0.012%