INDEX
Explanations
phrases emphasizing a specific belief or understanding
New Auto-Interp
Negative Logits
backer
-0.76
hens
-0.73
arthed
-0.73
adle
-0.72
aukee
-0.70
guard
-0.68
swick
-0.67
ensed
-0.66
eng
-0.66
ante
-0.66
POSITIVE LOGITS
somehow
0.85
they
0.80
someday
0.76
THEY
0.73
unless
0.73
justifies
0.71
anyone
0.71
everything
0.70
everyone
0.70
there
0.69
Activations Density 0.170%