INDEX
Explanations
words and phrases that indicate attention or interest
New Auto-Interp
Negative Logits
odyn
-0.16
omanip
-0.16
ameda
-0.15
icas
-0.14
incinn
-0.14
-BEGIN
-0.14
озв
-0.14
iliz
-0.14
itol
-0.14
avn
-0.13
POSITIVE LOGITS
attention
0.65
attention
0.51
att
0.44
attent
0.43
Attention
0.42
notice
0.42
ATT
0.40
внимание
0.39
Attention
0.39
attn
0.38
Activations Density 0.070%