INDEX
    Explanations

    phrases indicating causation or consequences

    Followed by ", we" or ", the"

    New Auto-Interp
    Negative Logits
    s
    -1.55
    ים
    -0.85
    ات
    -0.69
    ς
    -0.60
    URLException
    -0.57
    WriteTagHelper
    -0.56
    pherals
    -0.55
    stanbul
    -0.55
    bidities
    -0.55
    sted
    -0.53
    POSITIVE LOGITS
    o
    0.65
    er
    0.63
    <bos>
    0.63
    𝓵
    0.63
    𝓮
    0.62
    0.60
    𝓲
    0.59
    𝓭
    0.57
    𝓾
    0.54
    𝓴
    0.53
    Act Density 2.589%

    No Known Activations