INDEX
Explanations
text related to abstract concepts or arguments
phrases related to understanding and discussion
New Auto-Interp
Negative Logits
shit
-0.63
éĥ
-0.56
raping
-0.53
tumblr
-0.53
Virgin
-0.53
Same
-0.53
ById
-0.50
milo
-0.49
inferior
-0.49
murderer
-0.48
POSITIVE LOGITS
ascript
0.56
cautiously
0.55
spoiler
0.54
optimistic
0.53
summarize
0.51
conclud
0.49
bookmark
0.49
academic
0.48
optimism
0.48
diplom
0.48
Activations Density 2.295%