INDEX
Explanations
phrases related to reasoning or explanation
statements or phrases that indicate causation or reasons
New Auto-Interp
Negative Logits
stad
-0.72
ievers
-0.68
mson
-0.67
dar
-0.66
aux
-0.66
iard
-0.64
ature
-0.64
urch
-0.64
aven
-0.63
ults
-0.63
POSITIVE LOGITS
undoubtedly
1.03
doubtless
0.95
obvious
0.94
sheer
0.93
attributable
0.83
undeniable
0.82
simplicity
0.81
probably
0.80
evident
0.79
unavoidable
0.77
Activations Density 0.103%