INDEX
    Explanations

    room</MAX_ACTIVATING_TOKENS>

    New Auto-Interp
    Negative Logits
    c
    1.09
    ۔
    0.97
    '
    0.96
    ם
    0.95
    ق
    0.88
    ف
    0.87
    га
    0.82
    0.82
    cích
    0.82
    the
    0.81
    POSITIVE LOGITS
    1.23
     rooms
    1.21
     ROOM
    1.19
    el
    1.16
    ला
    1.15
     room
    1.12
     Room
    1.05
    ет
    1.01
    ed
    0.99
    ло
    0.97
    Act Density 0.019%

    No Known Activations