INDEX
Explanations
requests or statements followed by a specific action or instruction
instances of the word "Please," indicating a request or instruction
New Auto-Interp
Negative Logits
ounter
-0.70
attributes
-0.64
early
-0.64
shifting
-0.64
secret
-0.63
diss
-0.63
marvel
-0.62
prototyp
-0.62
skirm
-0.62
latent
-0.61
POSITIVE LOGITS
Please
3.25
Please
2.34
PLEASE
2.26
please
2.13
please
2.05
Sorry
1.43
Thank
1.41
PLE
1.25
Feel
1.24
Therefore
1.22
Activations Density 0.023%