INDEX
Explanations
events or actions that lead to significant consequences or outcomes
New Auto-Interp
Negative Logits
mercy
-0.63
entric
-0.61
symmetry
-0.60
aves
-0.60
avorite
-0.59
arest
-0.58
loopholes
-0.58
Vaughn
-0.57
irlf
-0.56
afort
-0.56
POSITIVE LOGITS
better
0.95
gers
0.94
ership
0.81
hunt
0.77
uez
0.75
nowhere
0.74
wig
0.74
ges
0.74
-+
0.74
bare
0.72
Activations Density 0.386%