INDEX
Explanations
mentions of the word "ub" with varying activation levels
references to "ub" as a recurrent pattern or theme
New Auto-Interp
Negative Logits
Lauder
-0.78
ORIG
-0.72
drift
-0.67
Atlantic
-0.66
alez
-0.65
Irma
-0.63
FUL
-0.62
Burnett
-0.61
backer
-0.61
agher
-0.61
POSITIVE LOGITS
lishing
1.26
bing
1.21
rious
1.15
lisher
1.12
lique
1.11
bed
1.10
bles
1.10
bish
1.10
lish
1.05
ilant
1.04
Activations Density 0.034%