INDEX
Negative Logits
broken
0.36
hanya
0.34
prowess
0.33
abstraction
0.33
loophole
0.33
mode
0.33
forced
0.32
com
0.32
kwa
0.32
mode
0.32
POSITIVE LOGITS
Understand
0.52
Water
0.51
Understanding
0.50
Martine
0.49
Understand
0.49
You
0.49
Aqu
0.49
Your
0.49
Hiking
0.48
<unused1167>
0.47
Activations Density 4.917%