INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    ,
    0.99
    ння
    0.97
    0.96
    ρους
    0.82
    percayaan
    0.81
    को
    0.80
    0.80
    یم
    0.80
    اتی
    0.79
    ق
    0.78
    POSITIVE LOGITS
    5
    1.34
     for
    1.14
    0
    1.12
    1
    1.08
    4
    1.01
    7
    0.94
    9
    0.93
    2
    0.91
    for
    0.90
    人不
    0.87
    Act Density 0.007%

    No Known Activations