INDEX
Explanations
expressions related to leaving and resulting actions or consequences
New Auto-Interp
Negative Logits
ecided
-0.14
ahn
-0.14
.ix
-0.14
ège
-0.14
ottage
-0.14
plusplus
-0.14
jed
-0.13
drv
-0.13
svp
-0.13
oice
-0.13
POSITIVE LOGITS
leaving
0.90
leave
0.84
Leave
0.77
leaves
0.77
Leaving
0.75
Leave
0.74
leave
0.72
Leaves
0.66
_leave
0.56
çķĻ
0.55
Activations Density 0.185%