INDEX
Explanations
specific patterns or phrases associated with alternative or contrasting actions
phrases indicating assumptions or alternatives
New Auto-Interp
Negative Logits
stad
-0.73
compr
-0.70
stead
-0.69
berra
-0.66
SG
-0.64
DL
-0.64
IQ
-0.63
culosis
-0.62
hard
-0.61
DA
-0.60
POSITIVE LOGITS
Downloadha
0.73
SPONSORED
0.72
mere
0.71
amiya
0.69
ours
0.68
Instead
0.68
relying
0.68
recourse
0.67
isites
0.67
Ľ
0.67
Activations Density 0.084%