INDEX
Explanations
references to dishonor and the consequences of betrayal
New Auto-Interp
Negative Logits
gave
-0.33
threw
-0.31
drew
-0.30
wrote
-0.29
grew
-0.28
blew
-0.28
saw
-0.28
took
-0.28
broke
-0.26
took
-0.25
POSITIVE LOGITS
taken
0.40
gone
0.40
seen
0.38
spoken
0.37
gotten
0.36
flown
0.36
Seen
0.35
Taken
0.35
eaten
0.34
idden
0.33
Activations Density 0.117%