INDEX
Explanations
phrases indicating a negative outcome or consequence
the word "the."
New Auto-Interp
Negative Logits
çīĪ
-0.79
fixme
-0.74
å§«
-0.73
=#
-0.69
respectfully
-0.66
assumes
-0.66
iji
-0.66
accordingly
-0.65
periodically
-0.65
isin
-0.65
POSITIVE LOGITS
slightest
1.35
usual
1.04
brightest
1.02
same
0.96
easiest
0.95
smartest
0.93
safest
0.87
fastest
0.84
anymore
0.79
nor
0.78
Activations Density 0.073%