INDEX
    Explanations

    mentions of criticism regarding media or artistic works

    New Auto-Interp
    Negative Logits
     --
    -0.22
     --↵
    -0.22
     our
    -0.21
    ï
    -0.20
    à¥ľ
    -0.19
     ours
    -0.19
    -0.18
    --↵
    -0.18
     ---
    -0.18
    our
    -0.18
    POSITIVE LOGITS
    ,[
    0.46
    [c
    0.45
    .[
    0.42
    :[
    0.39
    [
    0.36
     ^{[
    0.35
    0.35
    [[
    0.35
    ).[
    0.35
    {{
    0.32
    Act Density 1.372%

    No Known Activations