INDEX
Explanations
phrases related to providing explanations or reasons
New Auto-Interp
Negative Logits
atri
-0.74
heit
-0.71
arious
-0.71
sha
-0.70
arel
-0.69
olit
-0.69
iaries
-0.68
Pixel
-0.68
ography
-0.67
isse
-0.67
POSITIVE LOGITS
discrepancies
1.04
deaths
1.03
instability
1.02
outbreaks
0.95
widespread
0.92
inconsistencies
0.92
variance
0.92
disappearance
0.91
unexplained
0.91
delays
0.90
Activations Density 0.354%