INDEX
    Explanations

    phrases indicating statements or claims

    phrases that include claims or assertions about events or states of being

    New Auto-Interp
    Negative Logits
    intosh
    -0.67
    irez
    -0.67
    emort
    -0.63
    pite
    -0.63
    patch
    -0.61
    DOWN
    -0.61
    leground
    -0.60
    ortunate
    -0.60
     PLUS
    -0.59
     Reconstruction
    -0.58
    POSITIVE LOGITS
     behave
    0.79
     embody
    0.76
    esty
    0.72
     manipulate
    0.71
     perform
    0.71
    asted
    0.70
     satisfy
    0.68
     adhere
    0.67
    ads
    0.67
     speak
    0.66
    Act Density 0.105%

    No Known Activations