INDEX
Explanations
requests for action or assistance
instances of the word "please."
New Auto-Interp
Negative Logits
ARC
-0.79
arc
-0.72
advertisement
-0.69
senal
-0.66
existence
-0.65
ription
-0.64
ãĤµãĥ¼ãĥĨãĤ£ãĥ¯ãĥ³
-0.64
perty
-0.62
alone
-0.62
é¾
-0.61
POSITIVE LOGITS
sir
0.90
lyak
0.78
pardon
0.77
please
0.77
note
0.75
kindly
0.74
excuse
0.74
fill
0.73
Ignore
0.73
Please
0.73
Activations Density 0.013%