INDEX
Explanations
references to decisions or actions being made
phrases that involve various forms of the word "move" indicating actions or changes
New Auto-Interp
Negative Logits
omial
-0.74
sqor
-0.72
english
-0.68
etheless
-0.66
Koran
-0.65
ordon
-0.62
aples
-0.62
sung
-0.61
errors
-0.61
iciency
-0.59
POSITIVE LOGITS
toward
0.86
towards
0.85
able
0.83
ments
0.81
backs
0.80
rers
0.79
over
0.77
wright
0.75
ler
0.74
llan
0.73
Activations Density 0.035%