INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    	java
    -0.07
     extractor
    -0.07
     diffs
    -0.07
    感觉自己
    -0.06
    -0.06
     transformers
    -0.06
    -0.06
    _instructions
    -0.06
     architectures
    -0.06
    veloper
    -0.06
    POSITIVE LOGITS
     facing
    0.07
     Nam
    0.07
    rowning
    0.07
    amation
    0.07
    Sex
    0.07
    edicine
    0.07
    0.07
     rap
    0.06
    با
    0.06
    elen
    0.06
    Act Density 0.006%

    No Known Activations