INDEX
    Explanations

    interrogative phrases and questions relating to reasoning and analysis

    Follows "how," "why," or "where"

    how why where questions

    New Auto-Interp
    Negative Logits
     he
    -0.64
     they
    -0.62
     it
    -0.52
     we
    -0.47
    hiran
    -0.45
    juvant
    -0.43
    the
    -0.43
     the
    -0.43
     он
    -0.42
    they
    -0.42
    POSITIVE LOGITS
     does
    1.26
     do
    1.21
     did
    1.20
     Does
    0.99
     Did
    0.96
     are
    0.88
    Does
    0.88
    Did
    0.86
     is
    0.79
     can
    0.77
    Act Density 0.161%

    No Known Activations