INDEX
Explanations
negative outcomes or states
New Auto-Interp
Negative Logits
др
0.43
ญ
0.41
managedbuild
0.40
ProRes
0.39
泺
0.39
utilizza
0.38
যেন
0.38
ദി
0.38
ეგი
0.38
situations
0.37
POSITIVE LOGITS
pissed
1.02
piss
0.91
bitch
0.73
screwed
0.72
crap
0.70
fucked
0.68
screwing
0.63
fucking
0.63
damn
0.63
shitty
0.63
Activations Density 0.017%