INDEX
Explanations
model responses
the beginning of the assistant’s reply in a dialogue (the assistant turn marker or first token of the model’s message).
New Auto-Interp
Negative Logits
regressions
0.47
opérations
0.46
വിൽപ്പന
0.45
variété
0.45
auteurs
0.45
radiographs
0.45
ÜR
0.44
鸮
0.44
ফ্যাস
0.43
collaborateurs
0.42
POSITIVE LOGITS
sorry
0.69
sorry
0.69
plz
0.54
Sorry
0.49
Sorry
0.49
hehe
0.49
pls
0.47
they
0.47
answer
0.46
you
0.46
Activations Density 0.007%