INDEX
Explanations
be followed by adjective
tokens that occur in the assistant's long, explanatory/instructional response text (i.e., helpful, informative sentences).
New Auto-Interp
Negative Logits
ל
0.33
séparation
0.32
你
0.31
plufieurs
0.31
ЕЛЬ
0.30
nyní
0.30
ת
0.30
aldb
0.30
اسے
0.30
ل
0.29
POSITIVE LOGITS
0.47
an
0.44
able
0.44
be
0.42
in
0.38
a
0.37
t
0.35
of
0.35
e
0.34
friend
0.34
Activations Density 0.375%