INDEX
    Explanations

    phrases indicating causation or reasoning

    New Auto-Interp
    Negative Logits
    avad
    -0.18
    ÏĦÎŃ
    -0.16
    ettes
    -0.15
    scopes
    -0.15
    nox
    -0.15
    .override
    -0.15
    ancode
    -0.15
    artial
    -0.15
    .djangoproject
    -0.14
    hausen
    -0.14
    POSITIVE LOGITS
    745
    0.19
    965
    0.17
    797
    0.16
    _icons
    0.15
    964
    0.15
    zed
    0.15
    ared
    0.15
    ARED
    0.15
    815
    0.14
    verbatim
    0.14
    Act Density 0.082%

    No Known Activations