INDEX
Explanations
phrases indicating emphasis or persuasion
expressions of belief or trust
New Auto-Interp
Negative Logits
imilar
-0.74
sidel
-0.71
idian
-0.70
advant
-0.64
wich
-0.64
ouk
-0.61
ipment
-0.60
erness
-0.58
erto
-0.58
iri
-0.57
POSITIVE LOGITS
Yourself
0.65
hype
0.62
admit
0.62
zers
0.61
me
0.61
expr
0.60
WHEN
0.60
!:
0.59
deceive
0.59
Twice
0.58
Activations Density 0.102%