INDEX
    Explanations

    references to causation and reasoning in statements

    New Auto-Interp
    Negative Logits
    ird
    -0.15
     ourselves
    -0.13
    opl
    -0.13
     yourself
    -0.13
     Yourself
    -0.13
    /her
    -0.13
    ITAL
    -0.12
     yourselves
    -0.12
    igu
    -0.12
    372
    -0.12
    POSITIVE LOGITS
    å®ĥ
    0.48
     its
    0.44
     it
    0.44
    å®ĥ们
    0.42
     оно
    0.41
    ï¼Įå®ĥ
    0.40
     they
    0.34
     Its
    0.34
     nó
    0.33
    Its
    0.33
    Act Density 0.477%

    No Known Activations