INDEX
    Explanations

    language that indicates evidence and support in arguments or claims

    New Auto-Interp
    Negative Logits
    adle
    -0.17
    uck
    -0.16
    atin
    -0.15
    ives
    -0.15
     Claud
    -0.15
     Sims
    -0.14
     plat
    -0.14
    erty
    -0.14
    otty
    -0.14
     Barth
    -0.14
    POSITIVE LOGITS
    ocache
    0.16
    andest
    0.14
    alu
    0.14
    emmel
    0.14
    olest
    0.14
     Unc
    0.14
    جÙĪ
    0.14
    cola
    0.14
    çĮ
    0.14
    ç
    0.14
    Act Density 0.458%

    No Known Activations