Beyond blast: enabling microbiologists to better extract literature, taxonomic distributions and gene neighbourhood information for protein families.
Journal Information
Full Title: Microb Genom
Abbreviation: Microb Genom
Country: Unknown
Publisher: Unknown
Language: N/A
Publication Details
Related Papers from Same Journal
Transparency Score
Transparency Indicators
Click on green indicators to view evidence textCore Indicators
"the process of retrieval is described in detail in the text; the resulting accumulation of published keywords identifiers and accessions is provided (data s1).; the totality of these resources can be reviewed in the provided supplemental materials (data s2).; without a sequence known keywords/aliases must be used to acquire sequences from a general protein knowledge database (e g uniprotkb [ ] ncbi [ ] jgi-img [ ] bv-brc [ ]) (e g duf34 protein family data s1) that can then be used to query family databases (fig s3b).; table 1 duf34 homolog search terms and search engines selected for investigating the quantitative impacts of user choice for each upon literature queries duf34 homolog (keywords/search phrases) literature search tools qualitative search category engine/resource nif3 ngg1-interacting factor 3 nif3l1 gtp cyclohydrolase 1 type 2 duf34 ybgi pf01784 cog0372 yqfo cog3323 specialized scientific literature search pubmed europe pmc pubtator scinapse io base broad scientific literature database search scienceresearch org worldwidescience org broad scientific literature search microsoft academic google scholar to investigate the influence of keyword and search engine choice on the occurrence of false-positive paper yields a more thorough examination of a subset of specialized search tools was examined (i e pubmed pubtator scinapse io and europe pmc) ( fig 3 ; data s3).; using default search settings a total of 47 unique publications were retrieved with paperblast of which only three were determined to be false positives (6 4 %; fig 4 ; data s4).; false positives were still present among the results for both paperblast retrieval methods ( fig 4 ; data s5 and s6).; h sapiens and e coli duf34 homolog sequence results were found to share nine and seven unique publications with the hmm-based results respectively ( fig 4 ; data s5 and s6).; because of the differences observed in paperblast outputs when using the e coli or h sapiens outputs we repeated paperblast queries using seven diverse sequences from the duf34 family reflecting different superkingdoms and alternative domain architectures ( fig 4 ; data s7-s9).; the results of four different query result sets henceforth referred to as four distinct 'methods' of retrieval ({1} hmm-derived {2} pubmed text-derived {3} qcc/curated-derived and {4} three separate sequence-derived sets: (a) one using only e coli duf34 homolog sequence; (b) a second that was a merged pair derived from two sequences (duf34 homologs of e coli h sapiens ); and (c) a third constituted by merging seven sequence-derived query result sets (duf34 homologs of e coli h sapiens b cereus methanocaldococcus jannaschii methanococcus maripaludis saccharomyces cerevisiae schizosaccharomyces pombe sequences chosen to represent the putative diversity of the family) were compared to determine overlaps and method-unique yields ( fig 6a ; table s2; data s9).; moreover the sequence-based searches retrieved seven publications that were not captured by the hmm-based results the two single-sequence-based results (method {4}b) the one single-sequence-based result (method {4}a) or the idealized 'qcc cycle' method ( fig 6 ; data s10)."
"Conflicts of interest The author(s) declare that there are no conflicts of interest."
"Funding information Grant GM70641 to V dC-L."
Additional Indicators
Assessment Info
Tool: rtransparent
OST Version: N/A
Last Updated: Aug 05, 2025