INDEX
    Explanations

    phrases indicating trust or belief

    New Auto-Interp
    Negative Logits
    <bos>
    -1.71
    
    
    -0.90
    -0.86
    <?
    
    -0.84
    /***
    
    -0.79
    <?
    -0.79
    Fuckin
    -0.75
    FTFY
    -0.74
    “…”
    -0.72
    /*
    -0.71
    POSITIVE LOGITS
     Minang
    0.93
     bandung
    0.89
     thuy
    0.82
    baya
    0.79
     marea
    0.79
     Désolé
    0.77
     jaya
    0.76
     ados
    0.75
     bayern
    0.75
     embodi
    0.74
    Act Density 0.258%

    No Known Activations