INDEX
Explanations
phrases related to consequences of actions or delays
New Auto-Interp
Negative Logits
atari
-0.71
anon
-0.71
Sheep
-0.69
=-=-=-=-=-=-=-=-
-0.63
ille
-0.63
arest
-0.62
Seymour
-0.61
ivas
-0.60
arro
-0.60
aroo
-0.60
POSITIVE LOGITS
diligence
1.15
giving
0.91
lling
0.78
itations
0.73
dilig
0.72
cancell
0.70
solely
0.69
brance
0.69
itiz
0.66
iments
0.64
Activations Density 1.086%