INDEX
    Explanations

    dataset names and types

    New Auto-Interp
    Negative Logits
     bottoms
    0.44
     sendMessage
    0.43
     deletes
    0.43
     lashes
    0.39
     Tiktok
    0.39
     yesterday
    0.39
     doo
    0.38
     Sash
    0.38
     summon
    0.38
     erections
    0.38
    POSITIVE LOGITS
    数据集
    0.72
     dataset
    0.69
     datasets
    0.69
    Dataset
    0.67
     benchmark
    0.64
     Dataset
    0.62
    Benchmark
    0.62
    benchmark
    0.61
    开源
    0.61
    Datasets
    0.61
    Act Density 0.137%

    No Known Activations