INDEX
Explanations
words related to causality or influence
phrases that indicate causation or effects
New Auto-Interp
Negative Logits
ban
-0.65
thia
-0.65
scrimmage
-0.62
nurs
-0.60
---------
-0.59
76561
-0.58
ASE
-0.57
br
-0.57
pump
-0.56
Witch
-0.56
POSITIVE LOGITS
hift
1.16
sure
0.97
akable
0.81
paio
0.80
ensibly
0.77
ailable
0.75
enders
0.75
emort
0.74
ebin
0.74
rontal
0.73
Activations Density 0.120%