We are looking to take on interns for the summer with the possibility of further full-time employment.

All our projects start with a biological problem. You will work closely with us, receiving regular feedback and individual mentoring as you work to create a user-friendly computational solution to the problem. We expect that the work you do in your internship will have a high chance to make it into production and available to all our users.

Here we are presenting several project proposals. However, Genestack is evolving very rapidly and is flexible to accommodate new project ideas. So, please reach out to us, if you’d like to discuss our alternative/ upcoming projects, or should you like to propose your own project idea.

Project 1: Single-cell RNA-seq analysis and interpretation

Aim: to explore and implement improvements to Genestack’s single-cell RNA-Seq analysis and interpretation pipeline.

Genestack has been developing an analysis and visualisation pipeline for single-cell RNA-Seq experiments. This includes methods to assess intercellular heterogeneity, a challenging and crucial step in single-cell RNA-Seq analysis where the amount of technical variation can be very high: we have implemented methods to identify heterogeneously expressed genes with or without spike-in data. We have also integrated visualisation and clustering methods to identify and assign cells to cell subpopulations based on the similarity of their gene expression profiles. Several dimensionality reduction techniques exist (PCA, t-SNE) coupled with automatic cluster assignment (k-means clustering).

Single-cell gene expression profiling is gaining wider adoption and is becoming more scalable than ever, producing single-cell datasets on with up to millions of samples. This opens up new computational challenges and bioinformatics opportunities. In this project, we will look into addressing both areas. To address the computational challenges, we’re looking to:

  • Integrate with faster expression quantification pipelines on Genestack
  • Make the visualization application scalable to handle millions of cells

In the area of bioinformatics, there are several alternatives:

  • Identify better features for subpopulations classification & analysis.
  • Look for ways to deconvolute cell cycle (using e.g. a latent variables model).

In this project, you will have the opportunity to deal with large-scale biological datasets and build powerful visualisation tools to mine them, as well as explore bioinformatics methods at the forefront of genomics and transcriptomics research.

Project 2: Genetic variants analysis and interpretation

Aim: to explore and implement improvements to Genestack’s variants analysis and interpretation pipeline.

Genestack has a well-developed pipeline for WES/WGS analysis, from preprocessing and quality control, to variant calling, annotation, and association analysis integrated with external databases such as dbSNP and the 1000 Genomes Project. We’ve also built an application that lets you browse millions of variants seamlessly with real-time querying and sorting capability.

We’d now like to enhance this by adding more exploratory analyses capabilities and support for rare variant association testing. For exploratory analysis, we’re looking into clustering analysis of the individuals as well as clustering analysis of SNPs associated with disease (to see which variants are inherited together and if they tell us anything about pathways).

Our current association analysis is based only on single genetic variants, but this method is underpowered for testing rare variants, which can play key roles in influencing complex traits and diseases. To address this limitation, we’re looking to integrate SKAT, which allows for SNP-set (e.g. a gene or a region) level testing for association between a set of variants and dichotomous or quantitative phenotypes.

In this project, you’ll have the opportunity to analyse and explore hundreds of real patients WES datasets with multiple diseases and various phenotypes and make a contribution to discovering results of clinical significance.

Project 3: Machine learning application in Bioinformatics

Aim: to use standard and single-cell RNA-seq expressions to predict tissues and cell types.

Genestack has an extensive collection of private and public genomics and transcriptomics datasets. This includes a collection of public microarray and RNA-seq experiments from public repositories, including well-annotated metadata. We have been processing them and accumulating an increasingly larger amount of gene expression data. We are now interested in leveraging this data to learn the expression signatures of specific phenotypes: for instance, training a predictive model that assigns tissue on the basis of a sample’s expression profile. This approach can then be extended to predict other phenotypical attributes such as disease, cell line, or cell type. We are also interested in applying machine learning approaches to RNA-seq datasets at the single-cell resolution, for cell subpopulations discovery.

In this project, you will help with researching appropriate machine learning strategies, designing and implementing them, as well as performing benchmarking/validation analysis.

Project 4: Genomics-based Crop Analysis Using Multi-omics Gene-Trait Networks

Aims:
(1) To integrate multi-omics data and literature mining for constructing plant knowledge networks.
(2) To develop query/visualisation features for interrogating the knowledge network and finding candidate genes associated with phenotypes.

Genestack is working on an Innovate UK funded collaborative project with Rothamsted Research, a major UK agri-genomics research centre and the longest running agricultural research station in the world. Over the past ten years, Rothamsted Research has been developing techniques to integrate multi-omics data and literature mining for gene networks analysis. This will enable scientists to use high-throughput bioinformatics technologies to accelerate genomics-based crop improvement and protection.

These tools have now been integrated into Genestack: users now have access to a simple, streamlined process from data collection, knowledge network building, to knowledge discovery. Furthermore, these tools have now been integrated with other Genestack applications to aid the knowledge discovery process. For example, using the homology inference application, users can now link networks between different species via protein homology relationships: allowing a novel organism to be analysed immediately using a well-annotated organism.

In this project, you will assist our effort with the construction and exploration of the knowledge network in several directions:

  • Identifying and integrating new -omics dataset to enhance the generated knowledge network
  • Automated summarisation: generating text summary of a list of genes/proteins based on the information in the network. This summary is useful to infer the function of a gene/protein
  • Improving upon the current network-based visual analytics, e.g. overlaying expression data on the network

You will work with researchers at Rothamsted, and selected industry participants who will help steer the project by contributing with specific requirements and use cases.

All projects:

Requirements:

–Scripting proficiency with Python and/or R, including data analysis libraries (e.g. Bioconductor, pandas, numpy, etc.)
–Proficiency with the UNIX command-line
–Good knowledge of statistics
–Being passionate about finding a rigorous way of investigating an ill-defined biological question
–Ability to report scientific findings

Ideally:

–Experience in bioinformatics data analysis: knowledge of data types, pipelines, databases, and typical challenges/caveats associated with bioinformatics analysis