INDEX
    Explanations

    punctuation marks, particularly quotation marks and apostrophes

    New Auto-Interp
    Negative Logits
    '));
    
    -1.08
    }');
    -0.97
    ]');
    -0.96
    )');
    -0.93
    '){
    
    -0.91
    %");
    -0.89
    _
    
    -0.86
    ...');
    -0.85
    '):
    
    -0.85
    /');
    -0.82
    POSITIVE LOGITS
    1.60
     “
    1.27
    ("
    1.22
    1.16
    ,“
    1.15
    (“
    1.12
    .“
    1.10
    "
    1.10
     "
    1.09
    ="
    1.09
    Act Density 0.545%

    No Known Activations