INDEX
    Explanations

    genuine or true + positive concept

    New Auto-Interp
    Negative Logits
    𝙥
    1.36
    𝙤
    1.36
     twos
    1.33
     lesion
    1.30
    𝒐
    1.30
    িণ
    1.29
     refrain
    1.27
    𝑛
    1.26
     vien
    1.26
    нци
    1.25
    POSITIVE LOGITS
    ところ
    1.65
     estate
    1.61
    paar
    1.58
    politik
    1.58
    অর্
    1.53
    পক্ষ
    1.52
    ignment
    1.49
    ligen
    1.43
    1.41
    यल
    1.41
    Act Density 0.165%

    No Known Activations