INDEX
    Explanations

    negations or expressions of disagreement

    New Auto-Interp
    Negative Logits
    rape
    -0.17
    asia
    -0.14
    writeln
    -0.14
    reu
    -0.14
    wire
    -0.14
    bilt
    -0.14
    aoke
    -0.14
    finity
    -0.13
    drivers
    -0.13
    claimer
    -0.13
    POSITIVE LOGITS
     longer
    0.32
     different
    0.31
     doubt
    0.29
    xious
    0.28
    thin
    0.27
     exception
    0.26
     stranger
    0.25
     laughing
    0.25
     match
    0.25
     ordinary
    0.23
    Act Density 0.018%

    No Known Activations