INDEX
    Explanations

    tokens from the model/assistant's reply—especially self-referential or help/clarification phrases (the assistant speaking).

    New Auto-Interp
    Negative Logits
     iteratively
    0.55
     workable
    0.52
    🛠
    0.51
     применять
    0.50
     metodologia
    0.50
    深入
    0.50
     ሂደ
    0.50
    ড়ান্ত
    0.49
     CFRP
    0.48
     дета
    0.48
    POSITIVE LOGITS
     😊
    0.77
     smiley
    0.68
    😊
    0.66
    ☺️
    0.66
     :)
    0.64
     なさい
    0.63
     or
    0.63
     kawaii
    0.63
    赤ちゃん
    0.62
     어린이
    0.61
    Act Density 0.029%

    No Known Activations