INDEX
Explanations
tokens from the model/assistant's reply—especially self-referential or help/clarification phrases (the assistant speaking).
New Auto-Interp
Negative Logits
iteratively
0.55
workable
0.52
🛠
0.51
применять
0.50
metodologia
0.50
深入
0.50
ሂደ
0.50
ড়ান্ত
0.49
CFRP
0.48
дета
0.48
POSITIVE LOGITS
😊
0.77
smiley
0.68
😊
0.66
☺️
0.66
:)
0.64
なさい
0.63
or
0.63
kawaii
0.63
赤ちゃん
0.62
어린이
0.61
Activations Density 0.029%