INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     ours
    -0.10
    anan
    -0.10
     friendly
    -0.09
     Erotic
    -0.09
     Prostit
    -0.09
     Kiss
    -0.08
     kiss
    -0.08
     ï½°
    -0.08
     kissed
    -0.08
    illo
    -0.08
    POSITIVE LOGITS
     command
    0.13
     orders
    0.12
    åij½ä»¤
    0.12
     requests
    0.12
     commands
    0.11
    orders
    0.11
    commands
    0.11
    /command
    0.11
     instruction
    0.11
     request
    0.10
    Act Density 0.054%

    No Known Activations