INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     a
    -0.08
     litt
    -0.07
     Muj
    -0.07
     hưởng
    -0.07
    olar
    -0.07
     misunderstood
    -0.07
    ن
    -0.07
     of
    -0.07
    וף
    -0.07
     thị
    -0.07
    POSITIVE LOGITS
    0.07
     allow
    0.07
    0.07
     wannonce
    0.07
     dürfen
    0.07
     estimator
    0.07
     bietet
    0.06
    EditingStyle
    0.06
     pricey
    0.06
    iface
    0.06
    Act Density 0.031%

    No Known Activations