INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    PBS
    -0.08
    ovar
    -0.08
    Loan
    -0.08
     housed
    -0.08
     remodel
    -0.07
    aturing
    -0.07
    Histogram
    -0.07
    Rental
    -0.07
    Campus
    -0.07
    Fal
    -0.07
    POSITIVE LOGITS
     GPT
    0.09
     sinful
    0.08
     lli
    0.08
    <|end|>
    0.08
     vivid
    0.08
     unethical
    0.08
    ?↵↵↵↵
    0.08
     wrongdoing
    0.08
    GPT
    0.08
     THC
    0.08
    Act Density 0.047%

    No Known Activations