INDEX
    Explanations

    phrases indicating assumptions or beliefs

    phrases emphasizing assumptions or beliefs about societal issues

    New Auto-Interp
    Negative Logits
    guard
    -0.78
    ensed
    -0.71
    WER
    -0.71
    yna
    -0.70
    backer
    -0.70
    arthed
    -0.69
    inar
    -0.69
    eng
    -0.69
    hm
    -0.68
    AZ
    -0.66
    POSITIVE LOGITS
     somehow
    0.90
     someday
    0.82
     everyone
    0.81
     everything
    0.79
     they
    0.76
     rationality
    0.75
     anyone
    0.74
     these
    0.72
     abandoning
    0.71
     justifies
    0.69
    Act Density 0.192%

    No Known Activations