INDEX
Explanations
numerical information such as ages or quantities
references to specific ages, demographics, or notable individuals in discussions
New Auto-Interp
Negative Logits
acknow
-0.64
tyr
-0.60
"))
-0.57
happ
-0.57
aughed
-0.56
Defin
-0.52
SourceFile
-0.52
estern
-0.52
doesnt
-0.49
equality
-0.49
POSITIVE LOGITS
,
1.00
,.
0.96
_.
0.91
.
0.89
.,
0.87
*,
0.85
!,
0.84
*.
0.83
.[
0.78
%.
0.76
Activations Density 0.723%