INDEX
Explanations
model response introduction
markers indicating the start of the model/assistant’s response or AI-generated content within a dialogue structure.
New Auto-Interp
Negative Logits
reuses
0.37
Proportion
0.32
್ರ
0.30
ietta
0.30
िल्ली
0.30
lç
0.30
isements
0.29
ácil
0.29
ighi
0.29
itia
0.29
POSITIVE LOGITS
<h1>
0.44
आपने
0.42
fascinating
0.41
Okay
0.39
##
0.39
Você
0.38
термин
0.38
#
0.38
Sounds
0.37
你想
0.37
Activations Density 0.074%