INDEX
Negative Logits
juſt
-1.07
كلام
-0.94
nk
-0.92
durant
-0.91
aporation
-0.91
joyful
-0.90
grumpy
-0.89
tember
-0.88
לאחר
-0.85
jakość
-0.85
POSITIVE LOGITS
before
0.94
both
0.86
invid
0.84
hinh
0.83
bati
0.82
well
0.79
dirond
0.78
Before
0.78
}^{*}\0.77
at
0.76
Activations Density 0.005%