INDEX
    Explanations

    AI assistant explicit content refusal

    New Auto-Interp
    Negative Logits
    “[
    0.52
    /[
    0.51
     [
    0.49
    :][
    0.48
    ]=[
    0.48
    [][]
    0.47
    ,[
    0.47
    :]
    0.46
    ][
    0.46
    :[
    0.46
    POSITIVE LOGITS
     "(
    0.56
    "(
    0.55
    “(
    0.53
    。(
    0.49
     ...(
    0.49
     “(
    0.49
    ıkl
    0.47
    ).(
    0.47
    '(
    0.46
    )(
    0.44
    Act Density 0.006%

    No Known Activations