INDEX
Explanations
references to bears
references to bears
New Auto-Interp
Negative Logits
lectic
-0.78
icut
-0.78
anwhile
-0.75
ij士
-0.73
uries
-0.70
selves
-0.70
arters
-0.70
Gutenberg
-0.69
yrim
-0.69
ubuntu
-0.69
POSITIVE LOGITS
cub
0.94
bear
0.90
hugs
0.85
claws
0.84
Gry
0.84
hug
0.83
beit
0.83
xual
0.82
Grizz
0.82
paws
0.80
Activations Density 0.026%