INDEX
Explanations
phrases that indicate repetition or familiarity with ideas over time
New Auto-Interp
Negative Logits
anymore
-0.15
obar
-0.14
aris
-0.14
.habbo
-0.14
urent
-0.14
ogn
-0.13
issan
-0.13
wp
-0.13
ilde
-0.13
ÃŃÅ¡
-0.13
POSITIVE LOGITS
before
0.59
previously
0.56
before
0.48
Before
0.45
Before
0.43
elsewhere
0.42
antes
0.40
Previously
0.39
-before
0.38
Previously
0.38
Activations Density 0.169%