INDEX
    Explanations

    instances of dialogue and expressions of inquiry or explanation

    New Auto-Interp
    Negative Logits
    olis
    -0.15
    nga
    -0.15
    verbatim
    -0.13
    ild
    -0.13
    keh
    -0.13
    нина
    -0.12
    echo
    -0.12
    ora
    -0.12
    è¯Ń
    -0.12
    233
    -0.12
    POSITIVE LOGITS
     explain
    0.74
     explanation
    0.73
     explains
    0.68
     explaining
    0.67
     expl
    0.67
     explanations
    0.67
     explained
    0.66
     Expl
    0.64
     Explain
    0.62
    explain
    0.62
    Act Density 0.263%

    No Known Activations