INDEX
Explanations
references to political figures and their actions
New Auto-Interp
Negative Logits
reportedly
-0.24
seem
-0.19
seems
-0.18
seeming
-0.18
seemed
-0.18
obviously
-0.18
says
-0.17
evidently
-0.17
apparently
-0.17
Seems
-0.17
POSITIVE LOGITS
might
0.23
deserved
0.22
will
0.22
merits
0.22
shouldn
0.21
belongs
0.21
indeed
0.20
could
0.19
deserves
0.19
belonged
0.19
Activations Density 0.571%