Beyond blast: enabling microbiologists to better extract literature, taxonomic distributions and gene neighbourhood information for protein families.

Authors:
Reed CJ; Denise R; Hourihan J; Babor J; Jaroch M and 3 more

Journal:
Microb Genom

Publication Year: 2024

DOI:
10.1099/mgen.0.001183

PMCID:
PMC10926702

PMID:
38323604

Journal Information

Full Title: Microb Genom

Abbreviation: Microb Genom

Country: Unknown

Publisher: Unknown

Language: N/A

Publication Details

Subject Category: Microbiology

Available in Europe PMC: Yes

Available in PMC: Yes

PDF Available: No

Transparency Score
4/6
66.7% Transparent
Transparency Indicators
Click on green indicators to view evidence text
Core Indicators
Evidence found in paper:

"the process of retrieval is described in detail in the text; the resulting accumulation of published keywords identifiers and accessions is provided (data s1).; the totality of these resources can be reviewed in the provided supplemental materials (data s2).; without a sequence known keywords/aliases must be used to acquire sequences from a general protein knowledge database (e g uniprotkb [ ] ncbi [ ] jgi-img [ ] bv-brc [ ]) (e g duf34 protein family data s1) that can then be used to query family databases (fig s3b).; table 1 duf34 homolog search terms and search engines selected for investigating the quantitative impacts of user choice for each upon literature queries duf34 homolog (keywords/search phrases) literature search tools qualitative search category engine/resource nif3 ngg1-interacting factor 3 nif3l1 gtp cyclohydrolase 1 type 2 duf34 ybgi pf01784 cog0372 yqfo cog3323 specialized scientific literature search pubmed europe pmc pubtator scinapse io base broad scientific literature database search scienceresearch org worldwidescience org broad scientific literature search microsoft academic google scholar to investigate the influence of keyword and search engine choice on the occurrence of false-positive paper yields a more thorough examination of a subset of specialized search tools was examined (i e pubmed pubtator scinapse io and europe pmc) ( fig 3 ; data s3).; using default search settings a total of 47 unique publications were retrieved with paperblast of which only three were determined to be false positives (6 4 %; fig 4 ; data s4).; false positives were still present among the results for both paperblast retrieval methods ( fig 4 ; data s5 and s6).; h sapiens and e coli duf34 homolog sequence results were found to share nine and seven unique publications with the hmm-based results respectively ( fig 4 ; data s5 and s6).; because of the differences observed in paperblast outputs when using the e coli or h sapiens outputs we repeated paperblast queries using seven diverse sequences from the duf34 family reflecting different superkingdoms and alternative domain architectures ( fig 4 ; data s7-s9).; the results of four different query result sets henceforth referred to as four distinct 'methods' of retrieval ({1} hmm-derived {2} pubmed text-derived {3} qcc/curated-derived and {4} three separate sequence-derived sets: (a) one using only e coli duf34 homolog sequence; (b) a second that was a merged pair derived from two sequences (duf34 homologs of e coli h sapiens ); and (c) a third constituted by merging seven sequence-derived query result sets (duf34 homologs of e coli h sapiens b cereus methanocaldococcus jannaschii methanococcus maripaludis saccharomyces cerevisiae schizosaccharomyces pombe sequences chosen to represent the putative diversity of the family) were compared to determine overlaps and method-unique yields ( fig 6a ; table s2; data s9).; moreover the sequence-based searches retrieved seven publications that were not captured by the hmm-based results the two single-sequence-based results (method {4}b) the one single-sequence-based result (method {4}a) or the idealized 'qcc cycle' method ( fig 6 ; data s10)."

Code Sharing
Evidence found in paper:

"Conflicts of interest The author(s) declare that there are no conflicts of interest."

Evidence found in paper:

"Funding information Grant GM70641 to V dC-L."

Protocol Registration
Open Access
Paper is freely available to read
Additional Indicators
Replication
Novelty Statement
Assessment Info

Tool: rtransparent

OST Version: N/A

Last Updated: Aug 05, 2025