INDEX
Explanations
Instances where the text refers to the model's identity or system role (system/instruction messages declaring the assistant/AI).
New Auto-Interp
Negative Logits
Miami
-0.07
IFICATION
-0.07
popul
-0.06
analy
-0.06
Snow
-0.06
()).
-0.06
nos
-0.06
793
-0.06
mood
-0.06
атель
-0.06
POSITIVE LOGITS
handwritten
0.06
coarse
0.06
FStar
0.06
pravděpodob
0.06
geliş
0.06
wait
0.06
_GO
0.06
στι
0.06
olabilir
0.06
beer
0.06
Activations Density 0.005%