INDEX
Explanations
causal relationships or explanations in the text
terms related to causation and causal relationships
New Auto-Interp
Negative Logits
Leopard
-0.82
ardless
-0.68
HCR
-0.66
Unicorn
-0.66
Gro
-0.65
>>>>>>>>
-0.65
Flake
-0.65
Pip
-0.63
ushes
-0.63
chip
-0.62
POSITIVE LOGITS
ality
1.13
caus
0.96
istically
0.91
ally
0.89
ities
0.86
atorial
0.85
allo
0.84
uristic
0.79
inference
0.77
ually
0.76
Activations Density 0.018%