INDEX
Explanations
mentions or quotes involving personal actions or statements
New Auto-Interp
Negative Logits
stead
-0.65
apo
-0.63
gradation
-0.63
Detected
-0.61
belt
-0.61
ablishment
-0.61
sites
-0.59
force
-0.59
compan
-0.59
chars
-0.59
POSITIVE LOGITS
themselves
0.76
herself
0.72
onite
0.69
remorse
0.69
goodbye
0.68
hello
0.68
himself
0.65
edly
0.65
angrily
0.65
aloud
0.63
Activations Density 0.659%