INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     location
    -0.07
    -list
    -0.07
     locations
    -0.06
    -0.06
     unary
    -0.06
     sunset
    -0.06
     Mann
    -0.06
     nim
    -0.06
    worm
    -0.06
     بول
    -0.06
    POSITIVE LOGITS
     ethical
    0.11
     ethics
    0.11
     Eth
    0.09
    ethical
    0.09
     Ethics
    0.09
    esthetic
    0.08
     unethical
    0.08
    Eth
    0.08
     фил
    0.08
     ethic
    0.08
    Act Density 0.016%

    No Known Activations