INDEX
    Explanations

    phrases indicating causation or consequences

    New Auto-Interp
    Negative Logits
     Leadership
    -0.21
     leadership
    -0.20
    imos
    -0.16
    Leaders
    -0.16
    acha
    -0.16
    enga
    -0.15
    sein
    -0.15
    leaders
    -0.14
    iens
    -0.14
    imedia
    -0.14
    POSITIVE LOGITS
     nowhere
    0.27
     directly
    0.26
    gers
    0.25
     us
    0.24
     astr
    0.22
     them
    0.21
     ultimately
    0.20
     toward
    0.20
     towards
    0.20
     straight
    0.20
    Act Density 0.020%

    No Known Activations