INDEX
Explanations
instances of the word "look" and its variations used to direct attention
New Auto-Interp
Negative Logits
uner
-0.19
eka
-0.18
MENT
-0.17
eken
-0.17
ekt
-0.15
soever
-0.15
idor
-0.15
uctor
-0.14
/by
-0.14
ffen
-0.14
POSITIVE LOGITS
closely
0.21
sharp
0.21
familiar
0.21
ma
0.20
Sharp
0.20
sharp
0.19
outs
0.19
Fam
0.19
Ma
0.18
fant
0.18
Activations Density 0.017%