INDEX
    Explanations

    words related to safety and suggestions for better practices

    New Auto-Interp
    Negative Logits
     for
    -0.31
     длÑı
    -0.23
     untuk
    -0.22
    	for
    -0.21
     για
    -0.21
     pentru
    -0.20
     für
    -0.19
    为
    -0.18
     voor
    -0.18
    for
    -0.18
    POSITIVE LOGITS
     purposes
    0.81
     sake
    0.79
     reasons
    0.42
     purpose
    0.41
     PURPOSE
    0.34
    purpose
    0.32
     reason
    0.31
    pur
    0.30
    Purpose
    0.30
    æĿ¥è¯´
    0.29
    Act Density 0.674%

    No Known Activations