INDEX
Explanations
words related to actions with significant impact or consequences
actions related to manipulation or influence
New Auto-Interp
Negative Logits
ãĤ´ãĥ³
-0.80
ffe
-0.70
?)
-0.70
tesy
-0.67
option
-0.67
wink
-0.66
unny
-0.63
pse
-0.62
âĺ
-0.62
Unsure
-0.61
POSITIVE LOGITS
unsuspecting
0.93
uate
0.84
unwanted
0.77
enance
0.74
various
0.72
incoming
0.71
unwitting
0.70
passers
0.68
certain
0.68
alleged
0.67
Activations Density 0.390%