INDEX
Explanations
contrary statements about potential outcomes
New Auto-Interp
Negative Logits
iddle
-0.63
raft
-0.62
iling
-0.59
ding
-0.58
ature
-0.58
Enhance
-0.58
rift
-0.57
ishing
-0.55
anwhile
-0.54
guiName
-0.53
POSITIVE LOGITS
liked
1.12
been
1.06
gotten
1.02
benefited
0.99
been
0.98
preferred
0.96
fared
0.94
gladly
0.91
gone
0.85
avoided
0.85
Activations Density 0.054%