INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     safety
    -0.68
     IBOutlet
    -0.55
    izarra
    -0.53
    UnitTesting
    -0.52
    Lähde
    -0.51
     Safety
    -0.50
     SAFETY
    -0.50
    ấn
    -0.48
    safety
    -0.48
    ''');
    -0.48
    POSITIVE LOGITS
    addGap
    0.68
     ویکی‌پدی
    0.61
    قایناق‌لار
    0.59
     Вікі
    0.58
    displayquote
    0.57
    amation
    0.56
    fjspx
    0.56
    ftagPool
    0.55
    πάρχ
    0.55
    gouv
    0.52
    Act Density 0.177%

    No Known Activations