INDEX
Explanations
references to individual or group identities and their interactions in various contexts
New Auto-Interp
Negative Logits
becomes
-0.22
begins
-0.18
eaten
-0.17
dies
-0.17
наÑĩинаеÑĤ
-0.17
discovers
-0.17
emerges
-0.16
start
-0.16
blir
-0.16
learns
-0.16
POSITIVE LOGITS
recently
0.25
recent
0.24
currently
0.20
started
0.20
began
0.20
established
0.19
runs
0.18
Runs
0.18
runs
0.18
Recently
0.18
Activations Density 0.947%