INDEX
Explanations
instances where actions or decisions are taken in place of other actions or decisions
New Auto-Interp
Negative Logits
artisan
-0.56
Nap
-0.54
read
-0.54
Vers
-0.54
Palestin
-0.53
marine
-0.51
anded
-0.50
essen
-0.50
Guest
-0.49
STAT
-0.49
POSITIVE LOGITS
quitting
0.56
rever
0.56
being
0.55
fixing
0.53
dwelling
0.53
wasting
0.53
clock
0.53
retiring
0.52
anger
0.52
anything
0.52
Activations Density 13.812%