INDEX
    Explanations

    explains refusal reasons

    New Auto-Interp
    Negative Logits
    filenames
    0.69
    marks
    0.66
    Except
    0.63
    etc
    0.63
     Impacts
    0.62
    Example
    0.61
     Characters
    0.61
     Example
    0.60
    example
    0.58
    Examples
    0.58
    POSITIVE LOGITS
     twofold
    1.62
     threefold
    1.45
    在于
    1.15
     να
    1.10
     undoubtedly
    1.10
     underwhelming
    1.10
     toujours
    1.08
     largely
    0.97
     always
    0.96
     reportedly
    0.96
    Act Density 0.116%

    No Known Activations