Publication/Paper (NAR 2012)

From SEQwiki
Jump to: navigation, search

Paper for the NAR annual Database Issue

Note: NAR 2012 is proposed to be a 'wiki special', so I thought it would be a good idea to write up the work done and contents of on SEQwiki for that. For details, see the instructions for the NAR annual Database Issue (all the submitted wikis have to meet those criteria). --Dan 01:49, 1 August 2011 (PDT)
Note: See the NAR general guidelines for authors.
Note: For discussion, use the 'discussion' tab. See also the forum thread.

Title

SEQanswers Wiki: A Database of Tools for the Analysis of High-Throughput Sequencing Data

Abstract

Recent advances in sequencing technology have created unprecedented opportunities for biological research. The rapid increase in data output of those new technologies has also created remarkable challenges in data management and analysis. Consequently, more and more software is created by bioinformaticians, but this development of packages and algorithms is beginning to outpace the time frame of peer-reviewed publications and other traditional forms of information sharing. In some cases, the algorithms or even the methodology used by a package may change after its publication, making the publication irrelevant for the actual user.

The SEQanswers forum (http://SEQanswers.com) was founded to enable a more dynamic and direct means of communication among the packages' users than the traditionally slow and indirect interactions through peer-reviewed publications; it facilitates rapid dissemination of both wet-lab techniques and information regarding computational tools and analyses. The forum allows new tools, techniques and pipelines to be rapidly announced, tested and benchmarked within the active community.

The SEQanswers wiki is a Semantic MediWiki (SMW) site that is edited and updated by the members of the SEQanswers community. The wiki provides an extensive catalog of manually categorized analysis tools, technologies and information about service providers. The wiki pages provide structured (semantic) data for each tool, including data types and formats, capabilities, and provenance details as well as links to publications and online resources. All of the data is contribute by users in both structured and free text form using SMW's semantic data entry capabilities. A search tool provides a simple yet powerful means to retrieve information of tools; structured data can be queried and presented as reports directly within the wiki.

Within two years, the SEQanswers community has created pages for over 400 unique software tools, around 350 references and 500 web links. This collaborated effort has made SEQanswers wiki the most comprehensive and detailed catalogs of high-throughput sequencing related tools anywhere on the web.

Introduction

Science is entering an era of international collaboration. Recent breakthroughs of sequencing technologies have pulled together wet-lab biologists and bioinformaticians. Traditional biologists who wish to do quick bioinformatics analysis and junior bioinformaticians in early phase are often overwhelmed by a variety of published packages. Choices without clear attributes are mentally exhaustive rather than productivity enhancing. (http://www.physorg.com/news127404469.html , please cite the appropriate psychology paper). Meanwhile, users without access privilege to certain paid journals would otherwise be deprived of chances of understanding a package without an community review.

Complexity of the NGS field calls for community's effort

Co-exist with the ever expanding next generation sequencing (NGS) technologies is the sharp rise in variety of informatics tools. On average, (counts) of tools appeared each year. (divide the total number of tools /beginning of NGS) [anyone have idea?, compare this surge to earlier years]. Such rapid emergence of tools exceeds individual’s or even single institution’s capacity to monitor. Besides, funding issues has limited institutions' ability on non core activities . For example, Sequence Read Archive of NCBI, with no comparable alternative so far, will be phased out in next 12 months. Individual researchers find it even more difficult to keep track of the trend in the NGS field. Traditionally, journal club are held to educate the the current topics by sharing and debate. However, in such setting the kind of attendants are confined in a geographical space, usually within an institution. Besides, the limited number of topics per interval impedes the knowledge transfer of this NGS field. All these calls for a robust system for rapid and reliable way of knowledge sharing.

SEQanswers and SEQanswers wiki: A credible NGS community

Since its establishment in late 2007, SEQanswers has already been cited more than 35 [please update it] times in numerous high impact journals, including Nature and PLoS. SEQanswers aims to be a information resource and user-driven community focused on every aspects of next-generation genomics. SEQanswers is open to everyone regardless of scientific background or knowledge. The SEQanswers wiki originates from a discussion thread in SEQanswers, where users are presented with a list of packages according to the type of data analysis. Experience showed a forum works best for open discussion, but not for collaborative editing of resources. A wiki is hence setup to fulfill such requirement. SEQanswers wiki is a structured archive of bioinformatics tools for NGS analysis. SEQanswers wiki serves as a community annotated central portal to available bioinformatics tools. By itself SEQanswers wiki does not judge the usefulness of the tools listed. However, the popularity of tools is currently reflected by the number of view count of respective software's page. Simply, the more a tool is being searched or browsed through the tag-cloud, the higher is its ranking. In the future, we may implement a more reliable metric to measure popularity, possibly by the number of citations by peer-review journals normalized by the time after the tool's publication.

Wiki and community driven motion

Community driven databases and websites are not rare in field of life sciences. Some prominent examples include OpenWetware, a community curated laboratory protocol site widely used in iGEM; Biostar, an question and answer site for bioinformatics, computational genomics and systems biology; and BioTechnique forum, where people discuss more traditional laboratory techniques and troubleshooting.
The use of wikipedia in life sciences is pioneered by WikiGenes. WikiGenes is a gene-centric wikipedia system that linking traditional expert review information, for example NCBI Entrez and Uniport to new finding from published scientific articles. Each insights among all of a gene is honored by referencing to the original research article. While WikiGene users are beneficial by the this central portal to nearly all information available to a gene, the article's authors certainly receive much more audiences. SEQanswers wiki employs a similar concept to centralize but not direct the content of the site. Every SEQanswer registered member (free) are eligible to create and modify the content in the database through a intuitive way.

Goal of SEQanswers wiki

The goals of SEQanswers wiki are to:

(i) Gather and organize the ever growing bioinformatics packages by a community effort.

(ii) Provide a freely accessible, criteria-based searchable interface to facilitate selection of packages for analysis.

(iii) Accelerate informatics based knowledge exchange by bridging peer-reviewed journals and online community

The community: Number and type of contributions

SEQanswers wiki is open to edit by an active community of more than 19 thousand members around the world. Registered member can submit and curate profile of bioinformatic packages. Profile for both open-sourced and commercial package can be submitted with short description, its application domain, analysis method employed and the types of compatible NGS technologies. Notably, the package can be tagged if it's under active maintenance. Reference and abstract to the tools can be automatically fetched according to the PubMed ID. Currently, the database is deposited with 400 unique software tools, 350 references and 500 web links.

Walkthrough of SEQanswers wiki

(i) Software Hub,
Each package is tagged by the language it is programmed in; the operating system it runs; if it is still in maintenance, the NGS technologies it is designed to work with and the type(s) of analysis it performs. Users can narrow down the choices by searching with multiple parameters. Furthermore, a package can be tagged multiple times in each category (e.g. an RNA-Seq program can simultaneously do reads alignment and junction finding), users are able to find a package as long as any one of attributes is matched. Through tagging, Bioinformaticians focusing on real data can quickly retrieve a collection of up-to-date tools for analysis, while tools writers will be able to find the most comparable tools to benchmark their own programs. The database helps organizing the semantic information and let users and developers concentrate in productive analysis and development rather than finding the right tools to do the thing right. The Software hub uses tag clouding extensively. Tag cloud is a kind of visual representation to show the relative importance of a keyword among all. Tag cloud allow one to glimpse the trend in the field.

The overview of the software hub. The respective tag's size increase as more packages are tagged with it. In this example, packages written in C++ dominants, follows by Java and Perl. Most of the packages run in Linux. The number of packages with maintenance status still to be confirmed out-weighted the ones with a affirmative. Most packages are compatible with Illumina and 454 technologies

Comparisons and review of tools
SEQanswers wiki maintains an independent, community based hyper-focused reviews of commonly used bioinformatics tools. Users are introduced by essential knowledge of bioinformatics analysis. This section is complementary to the search function.

(ii) NGS Providers
This section is a compilation of NGS providers around the globe. Users can find a service provider by the type of services they need (whether sequencing, genotyping or analysis). Searching can be furthered narrowed down to service providers located in specific region of the world. Finding a service provider near to the user is of great importance to NGS users to ensure quick sample delivery and maintain sample integrity. This section is invaluable for researchers without NGS core facilities in their own's institution. Meanwhile, this section is also for those curious to get informed the current deployment of NGS services in different geographical locations.

The overview of Next Gen Sequencing Service Provider. Here shows the type of service provided by different service provider. By default, names are sorted alphabetically. User can customize the sorting. Optionally, names can be filtered by services, regions, areas or states. A new service provider can be added (top left). Existing entries can be edited

Strength and weakness of a community system for Biology

Similar to Wikipedia, SEQanswers wiki is open to edit but not anonymous. Each modification is associated to a registered user and can be reversed if faulty information is spotted. We noted the wikification of public database such as Genbank faced resistance. Although databases like Genbank is an important archive of rather static sequences and annotations, errors in these databases were previously found by researchers (any citation here?). Indeed, in order to allow prompt error correction while maintaining content accuracy, Steven Salzberg suggested adding a layer of wiki to existing curated databases. SEQanswers wiki, on the other hand, is a semantic wiki that serves as rapid, day-to-day reference for bioinformaticians. In this manner, a rapid communication overweights accuracy in description of each tools. In addition, the database by designs refer users to the publication of respective tool. Community knowledge of many tools are also easily to be found in SEQanswers. SEQanswers wiki is an rigorously community reviewed platform that serves many bioinformaticians every day.

Future Direction

SEQanswers wiki has been a successful platform to provide powerful searching capabilities on packages for users. Before SEQanswers, comments by users are usually directed solely to authors of respective packages. Reviews of packages by blogger are posted independently. SEQanswers has fostered both pre-publication and post-publication review on packages. Long before peer review publication, packages were usually announced in SEQanswers and tested extensively within the community (think DESeq, for example). Post publication improvement and benchmarking among developers is encouraged by discussions in SEQanswers (think Cufflinks vs DESeq vs DEGSeq vs ...). SEQanswers wiki aims to become an independent community-based review system to complement the peer-review publication sysetm and provide a central portal to NGS field.

Please read the sites below!

1) http://www.genomesunzipped.org/2011/07/why-publish-science-in-peer-reviewed-journals.php

2) http://scienceofblogging.com/post-publication-peer-review-blogs-vs-letters-to-the-editor/

3) http://cenblog.org/terra-sigillata/2010/12/08/post-publication-peer-review-in-public-poison-or-progress/

Funding

SEQanswers has advertising relationship with commercial companies. All relationships between SEQanswers and sponsoring companies are explicitly listed in About SEQanswers. These companies have no role in maintainance of SEQanswers, nor in writing of this manuscript. Discussion on SEQanswers is based on the entity of an individual registered with an account.

References

Acknowledgements

Description in accordance with BioDBcore standards

See some examples.

Database name SEQanswers wiki
Main resource URL http://SEQanswers.com/wiki/
Contact information webmaster@seqanswers.com
Date resource established (year) 2007
Conditions of use (Free, or type of license) Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported
Scope:
data types captured
curation policy manual curation
Standards: MIs, Data formats, terminologies
Taxonomic coverage
Data accessibility/output options
Data release frequency immediately after modification
Versioning period and access to historical files every modification is versioned; full version history is available
Documentation available
User support options forum, e-mail
Data submission policy any registered user may contribute; registration is not restricted
Relevant publications
Resource's Wikipedia URL http://en.wikipedia.org/wiki/SEQanswers
Tools available