INDEX
Explanations
references to leaked information and confidential documents
references to leaked information or documents
New Auto-Interp
Negative Logits
oran
-0.81
rians
-0.78
vil
-0.77
alach
-0.74
==
-0.70
igel
-0.69
============
-0.67
ellen
-0.66
phasis
-0.66
aple
-0.65
POSITIVE LOGITS
leaked
0.92
leaks
0.87
confidential
0.84
Leaks
0.84
leaking
0.79
leak
0.77
documents
0.74
sheets
0.74
snippets
0.74
closet
0.73
Activations Density 0.014%