INDEX
    Explanations

    fine-tuning, self-awareness

    New Auto-Interp
    Negative Logits
    the
    0.83
    єю
    0.81
    der
    0.80
    ний
    0.79
    щий
    0.78
    ted
    0.77
    <<
    0.75
    regated
    0.75
    de
    0.74
    tements
    0.74
    POSITIVE LOGITS
    𝗺
    0.90
     jika
    0.86
     वडील
    0.84
    𝗗
    0.84
     أبريل
    0.84
     ڈ
    0.82
    上半年
    0.80
    িকপ্ট
    0.80
     skriv
    0.80
     Faça
    0.79
    Act Density 0.212%

    No Known Activations