INDEX
Explanations
phrases related to urging or encouraging actions
instances of the word "urging."
New Auto-Interp
Negative Logits
Surv
-0.82
Flan
-0.73
Kinnikuman
-0.72
çĦ
-0.69
sung
-0.66
imb
-0.63
Highlander
-0.63
Sieg
-0.63
ppo
-0.61
ju
-0.60
POSITIVE LOGITS
urging
0.95
irection
0.86
itudinal
0.77
tip
0.76
usher
0.75
incent
0.75
=]
0.71
redients
0.69
caution
0.69
OPLE
0.69
Activations Density 0.018%