INDEX
Explanations
subjects performing actions
New Auto-Interp
Negative Logits
上述
0.55
Effective
0.55
effective
0.50
Percent
0.50
Overlap
0.48
pernyataan
0.47
suelen
0.46
必ず
0.46
際に
0.46
Typically
0.46
POSITIVE LOGITS
began
1.08
went
1.02
became
0.99
knew
0.97
laughed
0.93
took
0.92
awoke
0.85
panicked
0.85
lasted
0.85
hurriedly
0.85
Activations Density 0.106%