INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     targets
    -0.07
    Uno
    -0.07
     Brun
    -0.06
    creen
    -0.06
     extension
    -0.06
     cuối
    -0.06
     вс
    -0.06
     ALIGN
    -0.06
    trfs
    -0.06
     фін
    -0.06
    POSITIVE LOGITS
     Society
    0.21
     Soci
    0.12
     society
    0.12
     Soc
    0.10
    ociety
    0.10
    HS
    0.08
     soci
    0.08
    erties
    0.08
     societies
    0.08
    ักท
    0.07
    Act Density 0.008%

    No Known Activations