INDEX
Explanations
phrases inviting or encouraging engagement with content
New Auto-Interp
Negative Logits
ngth
-0.74
ragon
-0.70
ected
-0.70
bably
-0.68
urdue
-0.67
ãĥł
-0.65
posed
-0.64
aturated
-0.64
IDS
-0.64
fixed
-0.63
POSITIVE LOGITS
More
0.82
chu
0.82
MORE
0.76
Article
0.74
ers
0.70
ership
0.67
About
0.67
ABOUT
0.66
Less
0.65
about
0.65
Activations Density 0.120%