INDEX
Explanations
sentences that present a statement followed by an observation or explanation
New Auto-Interp
Negative Logits
ittees
-0.72
pees
-0.69
displayText
-0.69
vous
-0.68
oise
-0.68
ILA
-0.66
incial
-0.66
ĪĴ
-0.65
ãĥ¯ãĥ³
-0.64
ÑĤ
-0.64
POSITIVE LOGITS
"[
1.65
"â̦
1.55
"...
1.38
"'
1.26
'[
1.18
:"
1.12
"(
1.10
"
0.97
:[
0.96
""
0.94
Activations Density 0.273%