INDEX
    Explanations

    informative phrases containing instructions or explanations

    phrases that indicate instructional content

    New Auto-Interp
    Negative Logits
    enance
    -0.76
    oppable
    -0.67
    enegger
    -0.62
    threat
    -0.62
    volent
    -0.61
    Politics
    -0.60
    orical
    -0.60
    Rum
    -0.59
    AIN
    -0.59
    orter
    -0.58
    POSITIVE LOGITS
     to
    0.92
     toget
    0.79
    --------------------------------------------------------
    0.78
    semble
    0.78
    to
    0.75
     easy
    0.74
     you
    0.73
     much
    0.72
    To
    0.71
     TO
    0.70
    Act Density 0.069%

    No Known Activations