INDEX
    Explanations

    phrases indicating pathways or methods toward achieving something

    New Auto-Interp
    Head Attr Weights
    0:0.01
    1:0.02
    2:0.07
    3:0.06
    4:0.25
    5:0.01
    6:0.03
    7:0.35
    8:0.01
    9:0.03
    10:0.05
    11:0.06
    Negative Logits
     describ
    -1.78
     DEFENSE
    -1.58
    owe
    -1.55
    commit
    -1.54
    quartered
    -1.52
    IRED
    -1.51
    burse
    -1.51
    hillary
    -1.48
     waive
    -1.48
    enough
    -1.44
    POSITIVE LOGITS
     mush
    1.69
     Brill
    1.47
     Reality
    1.45
     tyranny
    1.44
     quot
    1.43
     mell
    1.41
     hordes
    1.38
    rama
    1.36
     reality
    1.36
     4090
    1.32
    Act Density 0.001%

    No Known Activations