INDEX
    Explanations

    contradictions and nuances in arguments

    New Auto-Interp
    Negative Logits
     not
    -0.21
    ä¸į
    -0.20
     NOT
    -0.20
     nicht
    -0.19
     không
    -0.19
    not
    -0.18
     niet
    -0.17
     не
    -0.17
    icz
    -0.17
    unch
    -0.16
    POSITIVE LOGITS
     rather
    0.44
     Rather
    0.41
    Rather
    0.39
    rather
    0.38
     instead
    0.35
     Instead
    0.33
    Instead
    0.33
     naopak
    0.32
     sondern
    0.32
    instead
    0.28
    Act Density 0.290%

    No Known Activations