INDEX
Explanations
dates in the format of month followed by day with a high activation for January 9th
New Auto-Interp
Negative Logits
helle
-0.83
atics
-0.79
atic
-0.76
APH
-0.73
atis
-0.71
ribution
-0.71
hematic
-0.69
onse
-0.68
anced
-0.68
ropolitan
-0.67
POSITIVE LOGITS
nect
1.05
st
1.03
esville
1.03
vier
0.98
itors
0.88
itor
0.85
iotic
0.81
eker
0.78
ality
0.75
FK
0.74
Activations Density 1.292%