INDEX
Explanations
textual features associated with instructions or procedural information
New Auto-Interp
Negative Logits
abled
-0.67
rams
-0.66
paralle
-0.66
nz
-0.63
sports
-0.62
sung
-0.62
ral
-0.62
bm
-0.61
far
-0.60
Share
-0.60
POSITIVE LOGITS
introdu
1.02
basics
0.85
Introduction
0.79
obligatory
0.75
impressions
0.74
rontal
0.73
initialize
0.72
obvious
0.71
hello
0.71
congratulations
0.69
Activations Density 1.034%