INDEX
Explanations
phrases that argue or advocate for a specific point of view or position
New Auto-Interp
Negative Logits
Seym
-0.70
orks
-0.66
liction
-0.65
elta
-0.64
attery
-0.63
ummer
-0.62
BW
-0.62
leted
-0.61
kered
-0.60
onder
-0.60
POSITIVE LOGITS
against
1.06
convinc
0.93
against
0.88
Against
0.86
cases
0.77
Keen
0.75
Against
0.75
loudly
0.74
persu
0.73
why
0.73
Activations Density 0.021%