INDEX
    Explanations

    questions that express confusion or challenge the status quo

    New Auto-Interp
    Negative Logits
     successive
    -0.84
     selective
    -0.71
     sustained
    -0.67
     environmental
    -0.66
    etheless
    -0.66
     continued
    -0.64
     gradual
    -0.63
     delays
    -0.62
     surplus
    -0.62
     fewer
    -0.62
    POSITIVE LOGITS
    fuck
    0.89
    soType
    0.88
    abouts
    0.79
    ãĤ§
    0.78
    isSpecialOrderable
    0.78
    fork
    0.75
     ;)
    0.74
    Fuck
    0.73
    Ñı
    0.73
    bang
    0.72
    Act Density 0.195%

    No Known Activations