Architecture


Client (Flash) / server (Apache + Mysql + Python) architecture. Below is a brief description of the architectural and data concepts.

Server architecture


The main server (PIPA server) hosts the Mysql database and the experimental data. It provides connectivity to the clients over an HTTPS connection. Additional workers can be configured to connect to the main server using a secured Mysql and Network File System (NFS) protocol.

Figure 1: PIPA architecture

PIPA repository files with repository root folder at /home/pipa/pipa_base:

  • pipa: server files (Python scripts)
    • analysis: baySeq, DEseq, ChIP-seq (MACS) analysis interfaces
    • config: server configuration files
    • data: methods for data manipulation (resolve multiplexing, sync with JBrowse and security files with .htaccess config options)
    • db: main database interface routines with Sqlalchemy
    • genes: methods to updates genes Mysql table from GFF files in the genomes folder
    • genomes: scripts and methods to download/parse/prepare various genomes
    • jbrowse: methods to add/edit JBrowse track instances
    • map: methods to handle experimental data mapping to reference sequences
    • path: methods for path construction for various purposes
    • pool: pooling is used to notify the client about changes on the server
    • profiles: gene expression profiles (e.g. time-points)
    • quality: data quality control routines
    • results: computes gene expression (results)
    • security: takes care of group membership/access
    • tickets: ticket routines
    • user: user management
    • utils: various one-function utilities
    • web: interfaces to Mysql tables (client communication over JSON)
  • pipa_client: client files (Adobe Flash Builder)
    • src: Actionscript source code of client web application
Data is stored separately (potentially on a different location/disk image, see Server configuration) with structure:

  • analysis: analysis results
    • analysis_1: results of analysis with id A1
    • ...
  • data: experimental raw data with quality control results
    • data_1: experimental data with id D1
    • ...
  • map: mapping results
    • data_1_map_1: mapping results with id D1_M1
      • results_1: results (gene expression) with id D1_M1_R1
      • ...
    • ...
  • genomes: assembly and annotation of available genomes
    • assembly: genome assemblies by folder (genome_id)
      • hg19: assembly of human genome, together with bowtie indexes
        • mapability_50: hg19 mapability for reads of length 50
        • ...
      • ...
    • annotation: genome annotations by folder (genome_id)
      • hg19: annotation of human genome (GFF3 and GTF), these are in sync with the genes Mysql table for fast searching
      • ...

Data concepts


The experimental data (FASTQ files) are stored under unique ids beginning with D (D1, D2, etc). Each experiment is mapped (aligned) to a reference sequence. The mapping ids start with M and begin with 1 (D1_M1, D1_M2, ... D10_M1, D10_M2, etc). Different types of gene expression results can be produced from each mapping. Results have ids starting with R (D1_M1_R1, D1_M1_R2, etc). Additionally, diverse analysis (differential expression, ChIP) can be run on mapping results or other types of data. Each analysis is assigned a unique id starting with A (A1, A2, etc).

Figure 2: PIPA data concepts

Installation


This guide assumes you are installing PIPA on a Ubuntu system (Apache, Mysql and Python 2.7) with username pipa (/home/pipa) and default Apache DocumentRoot (/var/www).

Quick start


Clone PIPA bitbucket repository to /home/pipa/pipa_base.

Install all server/bioinformatics dependencies to $HOME/software by running setup.sh.

Edit /etc/apache2/sites-enabled/000-default and add:

WSGIDaemonProcess ServerName processes=1 threads=100 display-name=%{GROUP} python-path=/home/pipa_base/pipa user=pipa group=pipa display-name=pipa_wsgi
WSGIProcessGroup ServerName
WSGIScriptAlias /pipa_api/ /home/pipa/pipa_base/pipa/web/
(replace ServerName with your server hostname or IP address)

Create database pipa on Mysql server and import the tables/views:
mysqladmin create pipa -u<username> -p<password>
mysql pipa -u<username> -p<password> < pipa.sql
(replace <username> and <password> with your local Mysql username and password)

Server dependencies


Following is a list of dependencies with suggested hints on how to install them on Ubuntu (where packages available). When instructions are "download and install", install the software to a location in your systems path.

Modwsgi
Client/server communication with JSON objects.
apt-get install libapache2-mod-wsgi

MysqlDB
apt-get install python-mysqldb

Sqlalchemy
apt-get install python-sqlalchemy

R
apt-get install r-base-core

Numpy
apt-get install python-numpy

Scipy
apt-get install python-scipy

Bioinformatics dependencies


JBrowse
Genome browser used for visualization of alignment results. Install to /var/www/jbrowse.

Bowtie v0.12.9
Short-read aligner, download and install (compile).

Bowtie v2.0.5
Short-read aligner, download and install (compile).

Samtools
Install latest version.

baySeq
Install by running R and entering:
source("http://bioconductor.org/biocLite.R")
biocLite("baySeq")

bedGraphToBigWig
Browse your operating system folder (e.g. for Ubuntu 64 you would open folder linux.x86_64) and download bedGraphToBigWig.

bedtools
Download and install.

Proceed to server and client configuration.

Features

Documentation of PIPA features.

Data management

PIPA supports experimental data in FASTA/FASTQ format. Here we describe a few basic steps to upload and annotate your experimental data.
Upload experimental data (FASTA/FASTQ format) from your local computer
Annotate experiments Map (align) reads to reference genome

Bioinformatics

Configuration


Server configuration


Rename file <pipa/config/__init__.sample to pipa/config/__init__.py. Edit this config file and populate all fields (see comments inside file for instructions).
Add /home/pipa/pipa_base to $PYTHONPATH envionment variable.
Download biox by cloning the repository to /home/pipa/biox. Follow installation instructions.
Add /home/pipa/biox to $PYTHONPATH envionment variable.

Client configuration


Copy precompiled Flash binaries from /home/pipa/pipa_base/pipa_client/bin to /var/www/pipa.
Edit /var/www/pipa/index.html and change line:
var flashvars = {'host':'myhostname', 'protocol':'http', 'wsgi':'pipa_api/'};
by replacing host variable with your server hostname/IP address. If necessary also change the protocol (http or https) and wsgi location.

Grid configuration



Python scripting guide


The client provides efficient control of data ana analysis. However, if you would like to perform further analysis and manipulate the data/results, one option is to use the scripting functionalities of PIPA.

If you open a Python shell on the server, your first step would be:

import pipa

Accessing data


To query all experimental data from the Mysql database, you would then use:
import pipa
from pipa.db import *         # import database objects
conn = Session()              # create session
data = conn.query(Data).all() # mysql table data is mapped to class Data
for data_record in data:
    print data_record.id
    print data_record.name
            
See database structure for details of tables and python classes representing them.

Running analysis


Database structure


The Mysql to Python is handled by SQLAlchemy. Here we describe Mysql tables and their ORM mapping to Python objects.

Mysql tables and Python objects


MysqlPythonDescription
dataDataExperimental data; each record represents one single-end (one file) or paired-end experiment (2 files) stored in the data folder
mappingsMappingsAlignment (mappings) of data to reference genomes; each record represents one mapping stored in the map folder
resultsResultsResults (e.g. gene expression); each record represents one result (from a specific mapping) stored in the results folder
results_templatesResultsTemplatesDifferent types of results (e.g. various gene expression scaling and normalizations); each record represents one result template
genomesGenomesViewGenomes table, linked to tables genomes_annotation (annotations in GTF/GFF3 are stored in the genomes_annotation folder) and genomes_assembly (FASTA, stored in the genomes_assembly folder)