INDEX
Explanations
describing negative situations
New Auto-Interp
Negative Logits
amerika
0.42
নিদ্র
0.41
sün
0.40
ட்சத்திர
0.40
indeks
0.40
ैटिन
0.39
खाते
0.38
鮋
0.38
notice
0.38
Paj
0.37
POSITIVE LOGITS
farce
0.50
overcrowding
0.46
happening
0.45
unbearable
0.44
ناقابل
0.44
intolerable
0.43
walls
0.43
absurdity
0.43
wretched
0.42
curfew
0.42
Activations Density 0.039%