INDEX
Explanations
the name "Han" followed by any other token, with different activation levels based on specific contexts
mentions of the name "Han."
New Auto-Interp
Negative Logits
Thumbnails
-0.68
destro
-0.65
ODUCT
-0.65
cipline
-0.64
tsky
-0.63
Colossus
-0.63
IMAGES
-0.62
utics
-0.62
llan
-0.61
Downloadha
-0.61
POSITIVE LOGITS
auer
1.12
ning
1.04
nington
0.99
lon
0.99
wei
0.98
uman
0.98
Solo
0.95
wal
0.88
ako
0.86
ergy
0.86
Activations Density 0.027%