INDEX
Explanations
imagining or pretending
Detecting role‑play or instruction prompts that directly address the model (second‑person setup statements like "imagine/you are..." defining a role or task).
New Auto-Interp
Negative Logits
----------</
-0.08
_tac
-0.07
消费品
-0.07
boobs
-0.07
::.
-0.07
.navigateByUrl
-0.06
shorts
-0.06
太多了
-0.06
getText
-0.06
/Math
-0.06
POSITIVE LOGITS
Boss
0.08
LR
0.07
IM
0.07
hunt
0.07
낀
0.07
RR
0.07
activating
0.07
releasing
0.07
setting
0.06
整个
0.06
Activations Density 0.030%