INDEX
Explanations
proper nouns and titles
references to academic professionals and their affiliations
New Auto-Interp
Negative Logits
proves
-0.55
fuck
-0.54
tumblr
-0.53
proved
-0.51
Prelude
-0.50
hinges
-0.50
abiding
-0.50
Whilst
-0.48
assassinate
-0.48
',
-0.47
POSITIVE LOGITS
]."
0.62
.).
0.59
>.
0.58
].
0.57
veland
0.54
gui
0.52
).
0.51
.�
0.51
]).
0.51
spokeswoman
0.50
Activations Density 0.742%