INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    PersonalInfo
    0.58
    Overwrite
    0.57
    Forgery
    0.55
    Shortcut
    0.55
    Demo
    0.52
    ItemGroup
    0.51
    Consent
    0.50
    Formal
    0.50
    Indicator
    0.50
    Decoder
    0.50
    POSITIVE LOGITS
     S
    0.64
     s
    0.51
     V
    0.51
     c
    0.51
     i
    0.50
     v
    0.48
     B
    0.47
     az
    0.46
     b
    0.46
     r
    0.45
    Act Density 0.004%

    No Known Activations