INDEX
Explanations
phrases related to advocating for particular concepts or beliefs
the word "the."
New Auto-Interp
Negative Logits
craft
-0.81
cies
-0.73
mares
-0.71
besides
-0.69
writes
-0.68
ells
-0.67
thood
-0.67
fn
-0.67
ersen
-0.67
each
-0.67
POSITIVE LOGITS
easiest
1.26
strongest
1.25
same
1.22
hardest
1.17
simplest
1.17
largest
1.15
heaviest
1.13
biggest
1.13
smallest
1.13
greatest
1.13
Activations Density 0.251%