INDEX
Explanations
statements about likely outcomes or predictions
phrases that indicate probability or likelihood of future events
New Auto-Interp
Negative Logits
inth
-0.86
gado
-0.81
aredevil
-0.77
zeb
-0.77
ithing
-0.75
ilts
-0.74
artney
-0.74
gencies
-0.73
gian
-0.73
ortmund
-0.71
POSITIVE LOGITS
underest
0.81
underestimate
0.75
infer
0.74
culprit
0.74
doomed
0.72
likely
0.70
underestimated
0.69
exagger
0.69
NULL
0.69
going
0.67
Activations Density 0.036%