INDEX
Explanations
phrases indicating a sequence of events
New Auto-Interp
Negative Logits
stakes
-0.29
ility
-0.29
welf
-0.28
Winged
-0.28
contradiction
-0.28
encouragement
-0.28
fighter
-0.27
ciplinary
-0.27
Peninsula
-0.27
borgh
-0.27
POSITIVE LOGITS
Ń·
0.44
abouts
0.41
ettings
0.39
EStream
0.39
atra
0.37
utm
0.36
orm
0.36
orthy
0.36
nih
0.36
daq
0.34
Activations Density 11.465%