INDEX
Explanations
the word "mask" with high activations, and related words like "disguise" with lower activations
references to masks and disguises
New Auto-Interp
Negative Logits
athan
-0.74
âĶĢâĶĢâĶĢâĶĢâĶĢâĶĢâĶĢâĶĢ
-0.72
scill
-0.72
course
-0.70
GGGGGGGG
-0.68
Yards
-0.66
lished
-0.66
ALLY
-0.66
ally
-0.66
rian
-0.65
POSITIVE LOGITS
masks
1.12
mask
1.06
resses
0.91
Mask
0.88
mask
0.83
wearer
0.80
Mask
0.80
worn
0.80
concealed
0.79
ħĭ
0.77
Activations Density 0.017%