INDEX
Explanations
phrases indicating the beginning of an action or process
New Auto-Interp
Negative Logits
entirety
-0.73
obi
-0.71
omb
-0.68
icol
-0.64
athed
-0.62
ingly
-0.61
mens
-0.61
ighth
-0.61
illard
-0.59
airy
-0.59
POSITIVE LOGITS
anew
1.02
ŃĶ
0.82
behaving
0.79
raining
0.74
dating
0.74
nings
0.73
experimenting
0.73
hemor
0.72
noticing
0.72
researching
0.71
Activations Density 0.068%