INDEX
Negative Logits
nesses
-0.83
ness
-0.82
nl
-0.77
staff
-0.71
NESS
-0.68
ENTION
-0.65
Beir
-0.64
sit
-0.62
ulates
-0.62
olicy
-0.61
POSITIVE LOGITS
cki
1.24
xit
1.20
ppo
1.07
uca
1.01
ño
0.92
ttes
0.91
esi
0.91
ea
0.91
lla
0.83
zzi
0.83
Activations Density 0.050%