INDEX
Explanations
phrases expressing approval or positive evaluations
New Auto-Interp
Negative Logits
dit
-0.18
yonel
-0.18
yms
-0.17
ein
-0.17
elli
-0.17
elect
-0.17
yen
-0.16
ModelError
-0.16
eum
-0.16
yll
-0.16
POSITIVE LOGITS
-known
0.33
spring
0.31
ington
0.29
-being
0.27
come
0.26
ows
0.25
enough
0.24
-rounded
0.23
llll
0.23
known
0.23
Activations Density 0.067%