INDEX
Explanations
references to societal norms and trends
New Auto-Interp
Negative Logits
two
-0.24
entirety
-0.21
three
-0.19
possibility
-0.19
chance
-0.19
entire
-0.18
presence
-0.17
zwei
-0.17
two
-0.17
slightest
-0.16
POSITIVE LOGITS
early
0.19
stuff
0.17
newer
0.17
etter
0.17
recent
0.17
earlier
0.16
ones
0.16
htar
0.15
things
0.15
later
0.15
Activations Density 0.094%