INDEX
    Explanations

    words related to reasoning and justification

    New Auto-Interp
    Negative Logits
    ammy
    -0.78
    kick
    -0.74
    yang
    -0.71
    jri
    -0.67
    chu
    -0.67
    JO
    -0.67
    rael
    -0.65
    hops
    -0.64
    hire
    -0.64
    along
    -0.64
    POSITIVE LOGITS
    izations
    1.36
    isations
    1.26
    ization
    1.22
    isation
    1.16
    izing
    1.15
    izers
    1.12
    istic
    1.11
    izer
    1.10
    isers
    1.06
    izes
    1.05
    Act Density 0.022%

    No Known Activations