INDEX
Explanations
verbs related to explaining or categorizing
New Auto-Interp
Negative Logits
spot
-0.73
die
-0.70
ker
-0.69
guard
-0.66
onew
-0.65
corn
-0.64
Torrent
-0.64
bang
-0.62
win
-0.61
bott
-0.60
POSITIVE LOGITS
ively
1.12
atively
0.93
urally
0.91
aspects
0.90
enance
0.89
ibly
0.88
ments
0.87
uate
0.81
ably
0.80
portions
0.77
Activations Density 5.927%