INDEX
Explanations
phrases related to recommending or encouraging specific actions or behaviors
repeated mentions of "the same" and concepts of doing the "right thing."
New Auto-Interp
Negative Logits
quished
-0.73
gat
-0.72
osponsors
-0.70
ildo
-0.69
ONSORED
-0.68
ospons
-0.68
raltar
-0.66
urated
-0.66
Leilan
-0.65
opened
-0.63
POSITIVE LOGITS
same
1.36
unthinkable
1.12
utmost
1.09
slightest
1.09
simplest
1.08
latter
1.00
hardest
1.00
same
0.98
groundwork
0.98
opposite
0.97
Activations Density 0.085%