INDEX
Explanations
phrases that highlight significant initial actions or observations
New Auto-Interp
Negative Logits
ancies
-0.91
doms
-0.79
sports
-0.76
sung
-0.74
contin
-0.73
raph
-0.73
etheless
-0.72
Journals
-0.71
rw
-0.68
TPP
-0.67
POSITIVE LOGITS
reaction
0.84
responders
0.81
introdu
0.80
foremost
0.77
sentence
0.71
blush
0.70
temptation
0.70
checkout
0.69
hurdle
0.67
knocks
0.66
Activations Density 0.052%