INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     Activ
    -0.87
     PRI
    -0.82
    -0.80
     puree
    -0.78
    ecturer
    -0.78
     Machiavelli
    -0.78
     coiff
    -0.75
     delicacy
    -0.73
     vechi
    -0.72
    年以上
    -0.72
    POSITIVE LOGITS
     s
    1.20
    stateParams
    0.78
    arikat
    0.77
     understated
    0.76
     creen
    0.72
     delas
    0.69
     ucran
    0.69
     Weir
    0.68
    Citations
    0.68
    stered
    0.68
    Act Density 0.022%

    No Known Activations