INDEX
Explanations
instances of persuasion or attempts to convince others
New Auto-Interp
Negative Logits
yms
-0.16
itler
-0.16
omb
-0.16
iet
-0.15
ookie
-0.15
uels
-0.15
instein
-0.14
lettes
-0.14
_PROGRAM
-0.14
ɵ
-0.14
POSITIVE LOGITS
argument
0.16
316
0.15
atively
0.15
apore
0.15
args
0.15
ltk
0.15
ingly
0.14
(?)
0.14
convin
0.14
arg
0.14
Activations Density 0.055%