INDEX
Explanations
phrases that imply manipulation or diversion of attention
New Auto-Interp
Negative Logits
ntag
-0.16
enco
-0.16
icio
-0.16
upstream
-0.15
dater
-0.15
InterfaceOrientation
-0.15
odega
-0.15
bish
-0.15
uts
-0.14
HOLDERS
-0.14
POSITIVE LOGITS
away
0.36
attention
0.33
toward
0.31
Away
0.30
towards
0.29
attention
0.29
Attention
0.28
divert
0.27
diverted
0.27
onto
0.25
Activations Density 0.060%