INDEX
Explanations
reasons or explanations
phrases that indicate reasons or explanations
New Auto-Interp
Negative Logits
tle
-0.80
ILCS
-0.78
bill
-0.75
wordpress
-0.74
sic
-0.72
raid
-0.71
bats
-0.70
iw
-0.70
Pred
-0.70
jaws
-0.69
POSITIVE LOGITS
mortals
0.79
preferring
0.69
fame
0.68
reason
0.67
executing
0.65
variance
0.64
reasons
0.63
stopping
0.63
canonical
0.63
invalid
0.62
Activations Density 0.172%