INDEX
Explanations
personal pronouns and affirmations in conversational contexts
I/you followed by verbs
tokens that mark the assistant/model's reply or role label (e.g., "Assistant", "Response", the colon after a role, or other assistant-turn markers).
New Auto-Interp
Negative Logits
useAppContext
-0.42
HasFactory
-0.40
ljiv
-0.40
-0.40
脚注の使い方
-0.40
tangentMode
-0.39
الإنجليزية
-0.38
tavo
-0.38
vuestro
-0.38
eterangan
-0.38
POSITIVE LOGITS
unsafe
0.46
Safe
0.44
Saf
0.43
المعيارى
0.43
Safety
0.42
safer
0.41
SAFE
0.41
aarrggbb
0.41
Safer
0.40
cup
0.40
Activations Density 0.009%