INDEX
Explanations
expressions of ability or capability
New Auto-Interp
Negative Logits
honors
-0.18
coloring
-0.17
Favorite
-0.17
Favorite
-0.15
modeled
-0.15
umor
-0.15
Flavor
-0.15
armored
-0.15
theater
-0.15
signaled
-0.15
POSITIVE LOGITS
Liked
0.15
Democr
0.15
edImage
0.15
organisers
0.15
image
0.14
.imag
0.14
erator
0.14
awah
0.14
-image
0.14
719
0.14
Activations Density 0.051%