INDEX
Explanations
references to general concepts or items that are significant or noteworthy
New Auto-Interp
Negative Logits
AnchorStyles
-0.96
pleaſure
-0.89
myſelf
-0.86
juſ
-0.85
ſtate
-0.83
Jefus
-0.83
ſtre
-0.83
uſe
-0.82
viſ
-0.80
ſta
-0.80
POSITIVE LOGITS
thing
1.37
things
1.36
Thing
1.34
THING
1.30
Things
1.28
Things
1.25
THINGS
1.24
Thing
1.21
things
1.10
THING
0.96
Activations Density 0.080%