INDEX
    Explanations

    references to people and their actions or attributes

    New Auto-Interp
    Negative Logits
    402
    -0.18
     preference
    -0.16
    789
    -0.15
    stract
    -0.15
    ouble
    -0.14
    ÙĦÙģ
    -0.14
    326
    -0.14
     Ben
    -0.14
     encounter
    -0.14
    cran
    -0.13
    POSITIVE LOGITS
    ادÙħ
    0.15
    eÄį
    0.15
    бо
    0.15
    تا
    0.15
    hq
    0.14
    anton
    0.14
    _SA
    0.14
    innen
    0.14
    ater
    0.14
    ause
    0.13
    Act Density 0.004%

    No Known Activations