INDEX
    Explanations

    negative aspects and progress

    New Auto-Interp
    Negative Logits
    0.89
    0.87
    0.86
    0.86
    0.82
    0.82
    雖然
    0.81
    他說
    0.80
    0.80
    0.79
    POSITIVE LOGITS
    线
    0.62
    0.60
    0.58
    0.57
    0.57
    0.56
    0.55
    0.55
    0.55
    0.55
    Act Density 0.021%

    No Known Activations