INDEX
Explanations
instances of URLs and mentions of specific characters or their actions
New Auto-Interp
Negative Logits
Fucking
-0.22
fucking
-0.21
fucked
-0.19
fuck
-0.19
shit
-0.18
fuck
-0.18
fucks
-0.16
bullshit
-0.16
_MACRO
-0.16
FUCK
-0.15
POSITIVE LOGITS
Dil
0.44
dil
0.35
Dog
0.26
Dog
0.24
dilation
0.23
Rat
0.21
diluted
0.21
dog
0.20
dog
0.19
Boss
0.18
Activations Density 0.004%