INDEX
Explanations
mentions of the assistant role/label, especially in chat headers or references to the assistant in text.
New Auto-Interp
Negative Logits
acids
-0.07
epidemi
-0.07
ूक
-0.07
damned
-0.07
Their
-0.07
-Regular
-0.06
gravel
-0.06
mitigation
-0.06
_ll
-0.06
$res
-0.06
POSITIVE LOGITS
_INLINE
0.07
ัพย
0.06
παίδ
0.06
roat
0.06
article
0.06
้ม
0.06
인터
0.06
_guard
0.05
_BASIC
0.05
:numel
0.05
Activations Density 0.028%