INDEX
Explanations
phrases related to causation or consequences
instances of actions or events that lead to consequences
New Auto-Interp
Negative Logits
arily
-0.72
toured
-0.71
contrace
-0.69
cared
-0.66
headed
-0.65
topped
-0.64
handled
-0.64
relied
-0.63
BN
-0.63
owed
-0.63
POSITIVE LOGITS
confirmation
0.79
confusion
0.75
bloodshed
0.73
extinction
0.71
dismissal
0.70
breakthrough
0.69
laughter
0.69
death
0.69
icial
0.68
forth
0.67
Activations Density 0.051%