INDEX
Explanations
instances of the word 'Truth'
occurrences of an end-of-text token
New Auto-Interp
Negative Logits
ounter
-0.87
ITIES
-0.76
ATIONAL
-0.74
astical
-0.73
chem
-0.69
evid
-0.69
âĶĢâĶĢ
-0.67
iners
-0.67
ATED
-0.66
AMES
-0.65
POSITIVE LOGITS
ful
1.04
Force
0.96
Works
0.94
bilt
0.90
Control
0.88
Machine
0.87
Girl
0.85
bringer
0.85
Matters
0.85
Sisters
0.85
Activations Density 0.125%