INDEX
    Explanations

    references to the significance of various concepts or issues

    New Auto-Interp
    Negative Logits
    InjectAttribute
    -0.76
    Портал
    -0.68
     Ginger
    -0.52
    Cham
    -0.51
     przew
    -0.51
    بوابة
    -0.50
    Ainsi
    -0.50
     Cuth
    -0.49
    mink
    -0.49
    piram
    -0.49
    POSITIVE LOGITS
    ={()
    0.92
     importance
    0.72
     ​​
    0.71
    ={()=>
    0.70
    praš
    0.66
    importance
    0.65
    őbb
    0.63
    Importance
    0.63
     pic
    0.62
     mahar
    0.61
    Act Density 0.085%

    No Known Activations