INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     correlation
    -0.07
     comprehension
    -0.07
     تعریف
    -0.07
    _choose
    -0.07
     быть
    -0.07
     subtraction
    -0.07
    _last
    -0.06
     Mar
    -0.06
     love
    -0.06
     PARAMETERS
    -0.06
    POSITIVE LOGITS
     powerful
    0.11
     мощ
    0.08
    ault
    0.07
     Pierre
    0.07
     Powerful
    0.07
     сила
    0.07
     Power
    0.07
    PFN
    0.07
     power
    0.07
    Power
    0.07
    Act Density 0.013%

    No Known Activations