INDEX
Explanations
references to behavior and behavioral changes
New Auto-Interp
Negative Logits
risten
-0.69
Milne
-0.68
Laurens
-0.68
Clooney
-0.66
gzip
-0.62
Koning
-0.62
Clarkson
-0.61
deadline
-0.61
tigma
-0.61
amak
-0.61
POSITIVE LOGITS
behavior
2.89
Behavior
2.68
behaviour
2.67
behavior
2.55
behaviors
2.51
BEHAVIOR
2.49
Behavior
2.39
Behaviour
2.38
behaviour
2.31
behaviours
2.30
Activations Density 0.091%