INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     Scaling
    -0.08
    Scaling
    -0.08
    betal
    -0.08
    _scal
    -0.07
     scalable
    -0.07
    -0.07
     Paths
    -0.07
    (Paths
    -0.07
     elucid
    -0.07
    (scale
    -0.07
    POSITIVE LOGITS
     consent
    0.10
     hypnot
    0.10
     consentimiento
    0.10
     willingly
    0.10
     unwilling
    0.09
     consens
    0.09
    երը
    0.09
     fantasies
    0.09
     predetermined
    0.09
     freiwill
    0.09
    Act Density 0.017%

    No Known Activations