INDEX
    Explanations

    conjunctions and phrases indicating connections between concepts

    New Auto-Interp
    Negative Logits
    himself
    -0.91
    herself
    -0.88
     herself
    -0.80
     Himself
    -0.80
     himself
    -0.79
    him
    -0.73
    themselves
    -0.73
     lui
    -0.72
    them
    -0.71
     and
    -0.71
    POSITIVE LOGITS
     there
    1.52
     it
    1.49
     although
    1.25
     they
    1.15
     the
    1.14
     its
    1.12
     while
    1.10
     this
    1.05
     when
    0.95
     these
    0.92
    Act Density 0.683%

    No Known Activations