INDEX
Explanations
phrases indicating the user to take action
phrases instructing the reader to take action or access information
New Auto-Interp
Negative Logits
pard
-0.67
defe
-0.60
withd
-0.57
resc
-0.55
taboo
-0.55
handedly
-0.54
disson
-0.54
palate
-0.54
experiment
-0.54
lik
-0.53
POSITIVE LOGITS
rid
1.21
TING
1.14
cloneembedreportprint
0.94
away
0.92
aways
0.89
Started
0.81
Tickets
0.79
Rid
0.78
notified
0.77
acquainted
0.76
Activations Density 0.041%