INDEX
    Explanations

    math problems

    New Auto-Interp
    Negative Logits
    -0.08
    -0.07
    018
    -0.07
    -0.07
    -0.07
     residues
    -0.07
    .jpa
    -0.07
    imei
    -0.07
     longing
    -0.07
    306
    -0.07
    POSITIVE LOGITS
    妹妹
    0.09
     nephew
    0.09
     Benchmark
    0.08
     Retriever
    0.08
     chores
    0.08
     donut
    0.08
    benchmark
    0.08
     Cascade
    0.08
    Lisa
    0.08
    girl
    0.08
    Act Density 0.135%

    No Known Activations