INDEX
Explanations
references to choice and decision-making
New Auto-Interp
Negative Logits
crackdown
-0.60
TestingModule
-0.57
aria
-0.53
paramref
-0.52
Genu
-0.52
popularity
-0.51
forbade
-0.51
essi
-0.51
early
-0.51
pity
-0.51
POSITIVE LOGITS
interag
0.92
interactions
0.91
interaction
0.84
interact
0.84
interacting
0.82
interacts
0.80
Interactions
0.79
Interactions
0.76
billions
0.73
shaped
0.72
Activations Density 0.481%