INDEX
Explanations
the word "be" with high activation values
instances of the word "be" in various contexts
New Auto-Interp
Negative Logits
Harlem
-0.64
ablishment
-0.62
guarantee
-0.60
ision
-0.60
Elise
-0.59
Mund
-0.58
Stain
-0.58
orer
-0.58
shall
-0.57
Must
-0.57
POSITIVE LOGITS
forgiven
1.03
able
0.94
mistaken
0.91
acons
0.90
construed
0.87
tempted
0.83
heading
0.79
considered
0.79
regarded
0.78
underestimated
0.77
Activations Density 0.148%