INDEX
Negative Logits
ires
0.84
the
0.74
t
0.73
ées
0.70
ING
0.67
m
0.65
inary
0.63
illing
0.62
ron
0.61
acion
0.61
POSITIVE LOGITS
kindness
1.17
Kindness
1.05
ва
1.02
ות
1.02
ко
0.99
ма
0.98
ان
0.93
ться
0.91
ם
0.91
ات
0.89
Activations Density 0.003%