INDEX
Explanations
phrases indicating a type of critique or evaluation
New Auto-Interp
Negative Logits
iez
-0.18
ral
-0.17
isman
-0.17
inery
-0.16
gis
-0.16
/basic
-0.15
rather
-0.15
basic
-0.15
basic
-0.15
za
-0.15
POSITIVE LOGITS
necessarily
0.25
anymore
0.23
ecessarily
0.17
nor
0.17
Drill
0.16
usual
0.16
particularly
0.16
matter
0.15
thing
0.15
rocket
0.15
Activations Density 0.071%