INDEX
Explanations
phrases requesting actions or conveying politeness
requests for action or feedback
New Auto-Interp
Negative Logits
arc
-0.77
ARC
-0.76
advertisement
-0.73
ription
-0.69
senal
-0.68
existence
-0.67
ula
-0.64
phrine
-0.63
mund
-0.61
bang
-0.60
POSITIVE LOGITS
sir
0.92
Ignore
0.80
fill
0.78
pardon
0.77
ignore
0.76
inquire
0.74
excuse
0.74
forgive
0.73
beware
0.72
ignore
0.71
Activations Density 0.013%