INDEX
    Explanations

    mentions of experiments involving human-like creatures

    New Auto-Interp
    Negative Logits
     interpol
    -0.16
    ugins
    -0.15
     interp
    -0.14
    afen
    -0.14
    quel
    -0.14
    aters
    -0.14
    dg
    -0.13
    stell
    -0.13
    _GC
    -0.13
    erland
    -0.13
    POSITIVE LOGITS
     experiments
    0.28
     experimental
    0.28
     research
    0.26
     experiment
    0.25
     Experimental
    0.25
     testing
    0.24
    experimental
    0.23
    experiment
    0.23
     tests
    0.22
    çłĶç©¶
    0.22
    Act Density 0.050%

    No Known Activations