INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    duh
    0.47
    Paw
    0.43
     arbeitet
    0.43
    grund
    0.42
    Flor
    0.42
    rawler
    0.41
    deen
    0.41
    car
    0.41
    Ri
    0.41
    ür
    0.40
    POSITIVE LOGITS
     difference
    0.51
     differences
    0.50
     διαφο
    0.49
     disclosure
    0.48
     additional
    0.46
     Differ
    0.46
     orig
    0.45
     difer
    0.45
     डिफर
    0.44
    额外的
    0.44
    Act Density 0.001%

    No Known Activations