I suspect the issue is that Dr. Mark Bailey didn't start with the beginning of the paragraph of the analysis he was quoting. I downloaded a copy of the actual analysis and will quote the complete paragraph below, bolding the first sentence:
**
At this point, the contig with the identification k141_27232, with which 1,407,705 sequences are associated, and thus about 5% of the remaining 26,108,482 sequences, should be discussed in detail. Alignment with the nucleotide database on 05/12/2021 showed a high match (98.85%) with "Homo sapiens RNA, 45S pre- ribosomal N4 (RNA45SN4), ribosomal RNA" (GenBank: NR_146117.1, dated 04/07/2020). This observation contradicts the claim in [1] that ribosomal RNA depletion was performed and human sequence reads were filtered using the human reference genome (human release 32, GRCh38.p13). Of particular note here is the fact that the sequence NR_146117.1 was not published until after the publication of the SRR10971381 sequence library considered here.
**
Source:
https://brandfolder.com/s/3z266k74ppmnwkvfrxs6jjc
It seems that the mathematician is saying that this 98.85% match was for a specific contig with the id of K141_27232.
Let's examine this in detail and see how disingenuous Dr Bailey is.
Why would Bailey leave off the sentence that shows that only 5% of theremaining sequences showed a 98.55 match? That would mean that of those remaining sequences less than 5% showed a match to homo sapiens RNA.
But even that is still misleading. When we look at the paper put on the internet by a mathematician the refused to give his name we find something else.
After filtering the
paired-end reads, 26,108,482 of the original total of 56,565,928 reads remained, with
a length of about 150 bp.
The mathematician used less than 50% of the original reads because he restricted himself to only using reads of 150bp. That seems a bit odd. Why would you claim you can't replicate something when use less than 50% of the data.
.4615 x .05 x .985 = 2.27
So the reality is that about 2.27% of the reads MAY have a match with homo sapien RNA. (It's actually quite a bit less as I will show.) The mathematician seems to fail to include the error in his claim that he insists exists in all reads.
So let's recap.
Dr Bailey cherry picks some sentences out of a paper by an anonymous author and we are expected to believe him.
The anonymous mathematician then filters the data to eliminate over half of it.1
Title
Structural analysis of sequence data in virology
An elementary approach using SARS-CoV-2 as an example
Author
By a mathematician from Hamburg, who would like to remain unknown
By filtering the data the mathematician comes up with significantly fewer contigs.To prepare the paired-end reads for the actual assembly step with Megahit
(v.1.2.9) [20], we used the FASTQ preprocessor fastp (v.0.23.1) [21]. After filtering the
paired-end reads, 26,108,482 of the original total of 56,565,928 reads remained, with
a length of about 150 bp.
From the paper that did the original sequencing.
Megahit generated a total of 384,096 assembled contigs (size range of 200–30,474 nt), whereas Trinity generated 1,329,960 contigs with a size range of 201–11,760 nt.
https://www.nature.com/articles/s41586-020-2008-3#Sec1
From our anonymous mathematician:
When you restrict your data, you would certainly be able to assemble fewer contigs. Something our anonymous mathematician seems unconcerned about.We obtained 28,459 (200 nt - 29,802 nt) contigs, significantly less than described in
[1].
But lets go back and look at that statement again that Dr Bailey decided to not include.
Is that right? Is that really only ONE of the 28,459 possible contigs from our mathematician that has a 98.5% match?At this point, the contig with the identification k141_27232
That would mean that the error was not even 1%. It was 0.000035% of the possible combinations came up with a possible match to human RNA.
What is interesting is whether the other 28,458 came up with a 98% or higher match to the Wuhan sequence. We will never know because the anonymous mathematician doesn't tell us.
But he does tell us this.
So he allowed a 1% per nucleotide error and then tries to claim a single contig with more than 6,000 nucleotides should be accepted as proof of something? Not very good math on his part.We set all bases with a
quality lower than 20 to "N" (unknown). A quality of 20 means an error rate of 1% per
nucleotide, which can be considered sufficient in the context of our analyses.
In conclusion, we have Baily cherry picking a statement from a mathematician that didn't use all the data. That is not evidence of anything other than fraud on the part of Baily.