INDEX
    Explanations

    the presence of high-activation words that convey importance or significance in a context

    New Auto-Interp
    Negative Logits
    zzar
    -0.55
    bootstrapcdn
    -0.51
     never
    -0.46
     Sharma
    -0.46
     Waray
    -0.46
    popd
    -0.46
    lectricité
    -0.45
    urllib
    -0.45
    neros
    -0.44
    ָׁ
    -0.44
    POSITIVE LOGITS
    tvguidetime
    1.20
     تضيفلها
    0.87
    ſelves
    0.85
    Datuak
    0.85
     Efq
    0.80
     Majefty
    0.79
     againſt
    0.78
     myſelf
    0.77
     houſe
    0.75
     itſelf
    0.73
    Act Density 0.035%

    No Known Activations