INDEX
Explanations
phrases requesting human verification
requests or prompts for user actions
New Auto-Interp
Negative Logits
pires
-0.79
laus
-0.72
rist
-0.67
pher
-0.67
itive
-0.67
borgh
-0.65
visor
-0.65
kept
-0.63
amus
-0.62
bent
-0.62
POSITIVE LOGITS
verify
0.91
Subscribe
0.81
0.79
enter
0.76
enable
0.76
Ignore
0.76
login
0.76
contact
0.74
disregard
0.73
inquire
0.73
Activations Density 0.011%