INDEX
Explanations
prepositions followed by abstract concepts or actions
prepositions and words indicating relationships or specifics about a topic
New Auto-Interp
Negative Logits
quartered
-0.81
olitical
-0.81
quet
-0.79
ynthesis
-0.76
ires
-0.73
acy
-0.69
ashington
-0.68
ired
-0.67
ensable
-0.67
eteenth
-0.67
POSITIVE LOGITS
somet
0.85
this
0.83
myself
0.83
figuring
0.82
yours
0.81
ya
0.81
0.81
yourselves
0.80
THAT
0.78
it
0.77
Activations Density 0.543%