INDEX
    Explanations

    words related to arguments or debates

    linguistic forms that denote actions or characteristics

    New Auto-Interp
    Negative Logits
    ilities
    -0.64
     Seym
    -0.63
    ulates
    -0.62
    raints
    -0.61
    ility
    -0.60
    ij士
    -0.60
    ulating
    -0.59
    ADRA
    -0.59
    SU
    -0.59
     ordinary
    -0.58
    POSITIVE LOGITS
    oad
    1.08
    ength
    1.00
    oaded
    0.93
    gling
    0.90
    ibrary
    0.89
    uci
    0.88
    ogue
    0.86
    phrine
    0.78
    erie
    0.77
    xual
    0.77
    Act Density 0.105%

    No Known Activations