INDEX
Explanations
references to specific concepts or practices within different belief systems or religions
New Auto-Interp
Negative Logits
*:
-0.70
!.
-0.65
!:
-0.64
+.
-0.63
';
-0.59
.:
-0.59
:,
-0.58
although
-0.57
jri
-0.53
*.
-0.53
POSITIVE LOGITS
pires
0.79
pired
0.72
differed
0.48
mattered
0.48
ihadi
0.47
might
0.46
FF
0.45
entails
0.45
Script
0.44
EVs
0.44
Activations Density 0.725%