Last Friday we had our final presentation of our project at – a poster session! By far the nicest course examination I have had.
I have to admit I was very nervous about this though, since you never know what questions will come and of course you know (at the back of your head, at least) that you are being evaluated.
Since we were mingling around to learn about the other projects, I didn’t hear all that was said about our poster, unfortunately. After the session I talked to Eva and Magnus to find out what feedback people gave them so that I could get a complete picture. Here is the feedback we received on our poster:
Using information from the MS 1 spectra to work with more data. (I think this was the other intensity measure that Lukas suggested, saying iBAQ might be a little too crude.)
Setting an R2 threshold lower than R2 ≥ 0.8 for the linear regression study to look at more R2 values.
Using histones for normalization of the leakage.
Comparing the iBAQ values with the LFQ values.
Investigating to which extent the proteases we have identified are active in the media.
As previously discussed: do some clustering!
Actually, MaxQuant outputs three different intensity measures per sample and proteinID:
intensity – the sum of all individual peptide intensities belonging to a particular protein group. Unique and razor peptide intensities are used as default.
iBAQ(Intensity Based Absolute Quantification) – values calculated by MaxQuant are the (raw) intensities divided by the number of theoretical peptides. Thus, iBAQ values are proportional to the molar quantities of the proteins. The iBAQ algorithm can roughly estimate the relative abundance of the proteins within each sample.””
LFQ (Label-free quantification) – intensities are based on the (raw) intensities and normalized on multiple levels to make sure that profiles of LFQ intensities across samples accurately reflect the relative amounts of the proteins.
We chose iBAQ basically since we had gotten the idea that this was the best option, but I don’t remember the actual reasoning behind this, unfortunately.
The reason we chose R2 ≥ 0.8 was that we wanted to narrow things down and find a nice linear fits. However, it is true that for biology an R2 ≥ 0.6 should be considered fine. So yes, maybe we missed some interesting and important trends setting the R2 ≥ 0.8.
As for the other suggested improvements we either didn’t think of them or simply couldn’t find the time, unfortunately. Also, we wanted to complete running the data in the OpenMS (TOPPAS) software with Percolator and TopPerc to compare, but this didn’t work for us and we were running out of time.
In the end the session went quite well and I really enjoyed it! Not so much the presenting part maybe, but definitely the mingling part where I could talk to the others to find out more about their projects. It was really nice to hear the others explain their project with so much enthusiasm and it is definitely more fun and interesting to learn about the projects this way rather than from a power-point presentation. I find this way it is more comprehensible since you can ask your questions more easily when they come up and the face-to-face interaction somehow makes at least my mind much more alert and engaged. I loved that part! Eva and I talked about this on our way home and how time just flew by without you being inactive or bored for even a second. I could have happily stayed even longer to ask the questions I forgot or didn’t have time for. All in all, a very well arranged session! It was so nice that the teachers had arranged with snacks and beverages and that KTH paid for this and the posters. I would like to thank the teachers for a great course and my fellow students for their nice presentations and everyone for the good atmosphere! Thanks also to Eva and Magnus for a nice collaboration!
Working with this project was really interesting, but I wish we had had more time. Or at least, that we could have worked on it full time, to learn more about the softwares and to do more analyses on our results. It felt a little like we had just gotten started when we reached the final week… Also, for next time I do a project like this, I will make even more of an effort to acquire more information about the experimental setup. At least for me, it is very important to know more about the background and how the experiment was carried out to fully understand and analyze the results. In any case, we had a good time and the blog part was fun, as well as the poster session!
So, here is the final, full version of our poster. Enjoy!
Though the project is over, I would love to hear back from anyone who has feedback on my blog, the poster and our work. Take care!
OK, so I thought I’d summarize the project. My aim here is to summarize things in a rather simple way so that, hopefully, anyone (even my mom) should be able to understand, without knowing anything about bioinformatics or proteomics.
BACKGROUND & AIM
What is proteomics?
Proteomics is the large-scale study of proteins: analysis of expression, localization, functions and interactions of proteins of an organism. (Proteins are the building blocks of life. They are macromolecules composed of chains (peptides) of smaller molecules, so called amino acids.) The proteome is the complete set of proteins produce by an organism or a system and it changes over time depending on conditions.
What do we need it for?
We are interested in learning more about the proteome of different cells and organisms to study the structure and function of proteins. This information, and how it varies over time and with different conditions, is essential to
compare tissues in healthy and diseased
finding candidate biomarkers for diseases (diagnostics)
designing and developing biopharmaceuticals
Shotgun proteomics – interpretation of generated spectra from all detectable proteins in a sample.
Targeted proteomics – targets specific peptides for quantification and detection of proteins. (The mass spectrometer is programmed to analyze a preselected group of proteins.)
What is bioinformatics?
Bioinformatics is a field of science combinig computer science, statisitcs, mathematics and engineering to analyze and interpret biological data.
This is an advanced course in bioinformatics, specifically concerned with analysis of large-scale data sets from genomic, transcriptomic and proteomic experiments and how these need to be treated differently statistically than smaller data sets. We learned about different methods and tools for handling and analyzing high-throughput molecular biology data and how they can be applied.
What was the name of the project?
“The Proteomics of CHO Cell Supernatants”. The title of the resulting poster was: “Exploring the Proteome of CHO Cell Supernatants“. Read about the poster session in my next entry, the last one.
What was the objective of the project?
The objective of this project was to identify proteins secreted by antibody producing CHO cells with the long term goal to help future cultivation optimizations and reduction of spill production, especially due to potential proteolytic enzymes. We were especially interested in studying secretion, leakage of intracellular proteins and proteolytic activity, as well as to study how the proteome changes over time.
More specifically, our aim was to:
Match our data against the Chinese hamster proteome to infer peptides and proteins on a controlled FDR level
Look for signal peptides on the identified proteins (using eg. Phobius)
Compare the results from different pipelines and/or protein inference strategies to see if there are any differences in the conclusions we can draw
What are CHO cells? What are they used for and why?
CHO cells are Chinese hamster ovary cells, a cell line often used for production of biopharmaceuticals. The cells have a great adaptability and are easy to work with. More importantly, they have the ability of providing the necessary modifications* to the newly produced proteins and then properly folding the proteins so that these become biologically active in humans. For example, CHO cells can be used to produce therapeutic antibodies.
* post-translational modifications, PTMs
What is a proteolytic enzyme?
A proteolytic enzyme is an enzyme that breaks down proteins into their component polypeptides or amino acids.
What is a signal peptide and why is it of relevance?
A signal peptide is a short peptide at the N-terminus of the majority of newly synthesized proteins that are secreted through the secretory pathway. If we know which proteins have a signal peptide we can basically know if they will be secreted or not. Intracellular proteins that don’t have signal peptides would not be secreted and therefor must have leaked out somehow if they are found in the supernatant.
Phobius is a site with a tool that predicts transmembrane topology and signal peptides, information that could be of importance. In our case, we were mainly interested in knowing which proteins had a signal peptide.
MATERIAL & METHODS
What were the material and methods that were used?
We analyzed raw data from a 45 day long experiment where antibody producing CHO cells were cultivated in a so called perfusion bioreactor. The data was based on supernatant samples from 17 different, unevenly distributed time points (D3-D45). Unfortunately, we had no biological replicates, only 3 technical replicates per time point. An LC-MS/MS analysis had been carried out with a so called QE-HF, which is an orbitrap mass spectrometer. We used the Chinese hamster (Cricetulus griseus) proteome available on UniProt as reference proteome for annotation. The raw data that the LC-MS/MS analysis generated was run in a software called MaxQuant (inferring peptides on an FDR level of 0.001). The results were studied with the help of Microsoft Excel/Google spreadsheet, homemade Python software (thanks to Magnus), UniProt and Phobius.
Mixture of proteins – digestion by the enzyme trypsin – fractionation – LC-MS/MS analysis: HPLC + MS/MS – MaxQuant: database searching – collected MS/MS spectra . correlation analysis – identification of peptides/proteins
(We also tried Open-MS (TOPPAS) but, unfortunately, it took too long time to install the different parts needed and build a working workflow. When we finally received the last of the data we still hadn’t got OpenMS running well so we decided to swap to MaxQuant full time.)
We anlyzed the data by the following approaches:
Linear regression study (R2 study) – to study how the abundance (intensity, based on iBAQ values) of proteins varied over time we decided to go for a linear regression approach to look for the best linear fits. Magnus did this through writing code in Python creating a program that performed the linear regression analysis and output plots for R2 ≥ 0.8.
Linear regression (R^2) study: UTP-glucose-1-phosphate uridylyltransferase
Linear regression (R^2) study: proteasome subunit alpha type-7
High intensity (iBAQ) study – to compensate for the fact that the linear regression study would miss all proteins that could not be nicely fitted to a linear trend, we decided to rank the 100, and then the 10, most abundant proteins based on iBAQ intensity values.
Overlap of R2 and iBAQ results – to show what proteins were present in both lists (based on iBAQ values).
Statitistical study with t-test – the study was conductred in a Python software coded by Magnus to divide all samples from the 17 timepoints were into two halves. The first part of the experiment (timepoint 1-8) and the scond half of the experiment (timepoint 10-17), excluding all data from the 9th timepoint. Only proteins with 5 valid values per experimental half were included. The t-test was carried out by a homemade Python software coded by Magnus to output box plots for significant changes.
An intracellular leakage study was conducted in a homemade Python software by Magnus. This was done by looking at how secreted proteins varied in abundance over time. Here, secreted proteins were taken to be the proteins that had been predicted to have a signal peptide. The iBAQ intensities of the predicted secreted proteins for each time point was divided by the total sum of iBAQ values.
What is LC-MS/MS? What kind of information does it give?
LC – liquid chromatography is a technique to separate and identify proteins from a mixture.
MS – mass spectrometry is an analytical technology used to
determine molecular weight
identify molecules with known molecular weight
MS can be used to analyze complex protein samples. Basically, it works like this:
the molecules are ionized.
the ions are then separated with respect to mass and charge (mass to charge ratio, m/z).
a detector detects the ions and their m/z and outputs a so called mass spectra with intensity on the y-axis and m/z on the x-axis.
Tandem MS (MS/MS) – technique/instrument with two inbuilt mass spectrometers:
MS1 – selects the peptides by mass one by one (precursor ion selection)
MS2 – analyses peptide fragments (the fragment ions of each peptide)
This is used to investigate or confirm a peptide sequence, analyze very complex samples (where peptide masses may overlap even with a high-resolution mass spectrometer) and to study PTMs.
Orbitrap – an iontrap MS instrument. Iontraps act as both MS1 and MS2, trapping, selecting and fragmenting the peptide ions before detecting the product ions.
LC-MS/MS is the combination of LC and tandem MS.
What is MaxQuant really and why was it used? How does it work?
MaxQuant is a software package for quantitative proteomics, specifically aimed at high-resolution MS data, to analyze large-scale MS data sets. It uses the Andromeda peptide search engine and uses a framework called Perseus to do statistical analysis. The input is the LC-MS/MS data in .raw format and the output is a series of documents with different information. One of these documents, called proteinGroups, contain a lot of data such as the intensity of the Read more about MaxQuant here or here.
Linear regression (R^2) study: proteasome subunit alpha type-7
Linear regression (R^2) study: UTP-glucose-1-phosphate uridylyltransferase
High intensity (iBAQ) study – the cellular component gene ontology study of the top 100 most abundant proteins showed that the majority of them were intracellular.
Out of the top 100, we looked at the top 10 most abundant proteins as measure by iBAQ intensity values. As can be seen in the table below, the most abundant ones are histones, proteins involved in nucleosome assembly. Except for three histones, there are also two proteins involved in protein folding.
What are histones?
Histones are proteins that package and order the DNA into structural units called nucleosomes in the eukaryotic cell nuclei. A nucleasome is unit of DNA packing in eukaryotes.
Protein topology study – the topology prediction showed that only around 19% of the top 100 iBAQ proteins had a signal peptide, and for the proteins with a nice linear fit only around 6% were predicted to contain a signal peptide for secretion.
Overlap study – we that there was an overlap of six proteins between the high intensity (iBAQ) study and linear regression (R2) study. Four out of these were proteasome subunits.
What are proteasomes?
Proteasomes are protein complexes that degrade damaged proteins, or proteins that are not needed. These proteins are tagged with ubiquitin, a small protein, to indicate that the proteins should be degraded.
Statisitical study with t-test –this shows that proteins related to cellular structure, such as decorin and collagen, significantly increased.The same went for proteolytic proteins that also showed and increased abundance from the first part of the experiment to the second. Note that “first 8 days” and “last 8 days” actually refer to the timepoints, not D1-D8 and D38-D45.
t-test: proteasome subunit
Intracellular leakage study – we found that the abundance of secreted proteins (as predicted by singal peptide) dropped with around 10% after the initial two time points. In any case the majority of abundant proteins were not secreted proteins thus indicating a substantial leakage of intracellular proteins.
Read more about the analysis and how we interpreted the results here.
What improvements could have been made?
Using histones for normalization of the leakage
Using information from the MS 1 spectra to work with more data
Comparing the iBAQ values with the LFQ values
Investigating to which extent the identified proteases are potentially active in the media
Doing some clustering
Running the data in another software (OpenMS) and compare
On an experimental level, it would have been great to have some biological replicates as control.
Why would histones be used for normalization of leakage?
Possibly because their abundance inside the cell should stay the same over time and therefor could be used as a reference.
Why should the MS1 information be used?
To not loose data; to have more data to work with.
Why compare iBAQ and LFQ values?
To see if there is a differance.
CONCLUSIONS & FUTURE
What were the conclusions?
We concluded that the majority of the proteins in the supernatants were intracellular and that their concentration increases over time. The most common ones have structural functions or proteolytic activity. Proteases showed a linear increase over time and could cause antibody product degradation. Further investigations are needed to see if these are active in the supernatant and what proteins they would degrade. All of this should be taken into account when optimizing the CHO cell cultivation and antibody production.
What could be done in the future?
The first step would be to test the improvements mentioned above to analyze the output data more deeply and learn more about the proteome for the antibody producing CHO cells. Hopefully, when more extensive results have been obtained and analyzed, the cultivation can be optimized and spill production be reduced to make the antibody production more efficient and qualitative.
Ps. the author had no conflict of interest to declare 🙂
I realized I haven’t fully commented on our results and written anything about our discussion and conclusions, so I thought I’d sum things up here before the last entry.
The linear regression (R2) study output all the proteins that varied linearly over time. This was done through a homewritten software in Python, coded by Magnus. For R2 ≥ 0.8, 33 proteins were listed with plots showed an trend of increasing protein abundance, just like one would expect for the produced antibodies. The abundance was measured in terms of intensity since there is a correlation between the output intensity from the MS/MS spectra and the concentration of the proteins. None of the plotted linear fits showed a decline in protein abundance, when the treshold for plotting intensity variation over time was set to R2 ≥ 0.8. The pictures below show two examples: proteasome subunit alpha type-1 and UTP-glucose-1-phosphate uridylyltransferase.
The intensity values chosen were the iBAQ values (calculating an avaerage of the replicates for each sample), not the LFQ values or “simple” intensity values. (If I got this right, the iBAQ and LFQs are both from the MS2, whereas the first simple intensity values are from the MS1 run and therefor from the raw data.) Perhaps it would have been better to chose the first intensity values instead? At least this was some of the feedback we got on the poster session. Another question that arose was whether chosing R2 ≥ 0.8 was that smart.
A statisitical study in terms of a t-test was carried out as well, to see if there was any significant increase in the protein levels in the supernatant. Though crude, this was done in the following way:
samples from the 17 timepoints were divided into two halves – the first part of the experiment (timepoint 1-8) and the scond half of the experiment (timepoint 10-17), excluding all data from the 9th timepoint. Only proteins with 5 valid values per experimental half were included. The t-test was carried out by a homemade Python software coded by Magnus to output box plots for significant changes.
Below are two examples that show a significant increase of protein between the first and the second half of the study. Note that “first 8 days” and “last 8 days” actually refer to the timepoints, not D1-D8 and D38-D45. The following trend was observed: decorin, collagen and other proteins related to cellular structure increased significantly, as did proteolytic proteins.
However, when studying only linear regression, we will miss all the proteins that don´t increase or decrease over time in a linear way. Some may for example start out with a low abundance, increase and peak, only to then decrease again. In order to compensate somewhat for the fact that a lot of interesting proteins maybe wouldn´t show up in the linear regression study, we decided to sort all the proteins based on intensity (again, iBAQ values).
High intensity (iBAQ) study: We did this in the way that we calculated the average intensity for each sample/timepoint and protein, excluding any zero values (since these are not necessarily truly zero) and then sorted the proteins based on intensity, as previously described.
After this we analysed the gene ontology as seen below and earlier stated. Out of the top 100 chosen highly abundant proteins, we chose to rank the 10 proteins that had the highest intensity in that list. This gave us the table below, where it can be seen those proteins were predominately involved in nucleosome assembly or protein folding and that three had a proteolytic activity.
What is interesting with this is that there are so many intracellular proteins with high intensity, meaning that they were highly abundant in the supernatant samples. Unfortunately, we didn’t study this intensity variation over time for these 100, or 10, proteins specifically so we don’t know the in which way the protein abundance varies. What is expected however is that more and more cells will die over time, resulting in the cells collapsing and releasing their content. This would be one explanation of the high abundance of intracellular proteins in the supernatant.
As previously mentioned, we studied the topology of the proteins in our top 100 list and the linear regression study (33 proteins). This was done based on UniProt results and Phobius predictions, even though the pie chart below do not show any discrepancy in the results between these two tools. Apparently most of the proteins were non-cytoplasmic without signal peptide. This means that only a few of the proteins listed in our studies would be expected to actually be secreted and end up in the supernatant together with the produced antibody.
Out of curiosity, we decided to check for overlaps between the two studies and found that six out of the 100 proteins in the high intensity (iBAQ) study were also in the linear regression study list of R2 ≥ 0.8.
Four out of six turned out to be proteasome subunits, and we now from the t-test above that the increase in their abundance was significant between the two halves of the experiment. Now, the questions is
why are there so many proteasome subunits detected?
are they in anyway related to the antibody production?
are they assembled and active?
are they active even in the supernatant and if so, what proteins are they targeting and tagging with ubiquitin for subsequent degradation?
If the proteasomes are active they could very well harm the antibody product. At the same time, proteasomes are definitely also a natural component in that they help recycling proteins and peptides that are misfolded, have been damaged or are not needed.
Finally, we did a study of the intracellular leakage (again, Magnus used his Python programming skills). We did this by looking at how secreted proteins varied in abundance over time. Here, secreted proteins were taken to be the proteins that had been predicted to have a signal peptide. The iBAQ intensities of the predicted secreted proteins for each time point was divided by the total sum of iBAQ values. As can be seen in the graph above, the level is around 30% for the two first time points, only to drop down to around 20% (and down to almost 15%) and stay there. This means that the amounts of secreted proteins decrease initially but then stabilizes and that, in any case, the majority of proteins found in the supernatant are not secreted proteins and therefore must be the result of an extensive leakage of intracellular proteins into the cultivation media.
Read a summary of the project here and find out how the poster session went here, along with some concluding remarks.
Here it is! The much longed for MaxQuant entry that I planned for some time now – and even promised.
MaxQuant is a software package for quantitative proteomics, specifically aimed at high-resolution MS data. It is designed to analyze large-scale MS data sets and to support all main labeling techniques, like SILAC, Di-methyl, TMT and iTRAQ, as well as label-free quantification. The software package comes with Andromeda peptide search engine and Perseus framework, for statistical analysis, integrated.
How does it work?
The software is based on a set of algorithms, including peak detection and scoring peptides. It performs calibration of mass and searches peptide databases to identify proteins, quantifies identified proteins and provides summarizing statistics. See here for user’s manual.
1) Raw data: correction for systematic inaccuracies of measured peptide masses and corresponding retention times of extracted peptides. The raw data can be inspected with the viewer application.
Viewer app: The Viewer module can either be used to a) get some prior information out of generated raw files or to b) find some follow up things after raw files have already been processed (tutorial here).
2) Peptide identification: mass and intensity of the peptide peaks in a MS spectra are detected and assembled into SD peak hills over the m/z retention time plane. This is filtered to identify isotope patterns through applying graph theory algorithms.
High mass accuracy is achieved by weighted averaging and through mass recalibration: the measured mass (of each MS isotope pattern) – the determined systematic mass error.
Peptide and fragment masses searching: organism specific sequence database search for peptide masses and fragment masses (in case of an MS/MS). MaxQuant has the search engine Andromeda integrated.
Andromeda is a peptide search engine. It is able to assign and score complex patterns of PTM, such as highly phosphorylated peptides, and accommodates extremely large databases. Identification of co-fragmented peptides improves the number of identified peptides. More info here
Scoring: peptide and fragment masses are scored by a probability-based approach termed peptide score.
Target-decoy approach and FDR: a target-decoy-based FDR (false discovery rate) approach is used to limit a certain number of peak matches by chance.
The FDR is determined using statistical methods that account for multiple hypothesis testing.
The organism specific database search includes the reverse counterparts of the target sequences (together with the “forward”/normal sequences) and contaminants to help determine a statistical cutoff for acceptable spectral matches.
3) Assembly of peptide hits into protein hits – to identify proteins: each identified peptide of a protein contributes to the overall identification accuracy.
Matching between runs: an FDR-controlled algorithm that enables MS/MS free identification of MS features in the complete data set for each single measurement = increased number of quantified proteins per sample
Perseus performs bioinformatic analyses of the output of MaxQuant and so completes the proteomics analysis pipeline (tutorials here).
What is the input?
.raw format of data from the MS run. In our case the LC-MS/MS analysis was carried out on an QE-HF (FT-MS or Orbitrap) generating the raw data.
What does it output?
MQ outputs several files of information, among others a .txt file called proteinGroups with proteins that share the same peptides grouped together. This file can easily be read in Excel.
What are the different columns we get?
Protein groups – group of proteins that share the same identified peptides. All proteins in a group has the same, or less, number of the identified peptides.
Unique peptide – unique sequence obtained by removing the redundancy from the peptide hits.
Peptide hits / spectra hits – the number of peptide-spectrum matches. Describes the relative abundance of a protein.The larger the protein, the more abundant it is.
Razor peptides – a peptide that has been assigned to the protein group with the largest number of total peptide identified.
If unique, the razor peptide only matches to this single protein group.
If not unique, the razor peptide will only be a razor peptide for the group with the largest number of peptide IDs.
If all peptide IDs from a sample analysis can be explained with the presence of a proteinGroup (A), the peptide should be assigned to this group as a razor peptide and you need not assume that there is a second proteinGroup (B) too. NB MaxQuant will still assign the peptide to the second proteinGroup (B) for your information, but not as a razor peptide. This group (B) will however only appear in the output proteinGroup file if it’s also identified by at least one unique peptide, since MQ will always generate the shortest proteinGroup list that is sufficient to explain all peptide IDs. Every peptide sequence is a razor peptide for one proteinGroup only. Read more here.
[Other columns… coming soon]
What are LFQ and iBAQ?
Intensities are the sums of all individual peptide intensities belonging to a particular protein group. Unique and razor peptide intensities are used as default.
LFQ (Label-free quantification) intensities are based on the (raw) intensities and normalized on multiple levels to make sure that profiles of LFQ intensities across samples accurately reflect the relative amounts of the proteins.
iBAQ(Intensity Based Absolute Quantification) values calculated by MaxQuant are the (raw) intensities divided by the number of theoretical peptides. Thus, iBAQ values are proportional to the molar quantities of the proteins. The iBAQ algorithm can roughly estimate the relative abundance of the proteins within each sample.””
What is the rationale for choosing iBAQ over LFQ?
How do you interpret the data?
What was the rationale for choosing MaxQuant over OpenMS?
In the end we chose to work with MaxQuant rather than OpenMS (TOPPAS). The reason for this was mainly that we were running out of time and needed to get some results so that we could proceed with analyzing the processed data. OpenMS was just too time consuming for us (being more or less inexperienced with this type of analysis) due to the many steps involved in building a proper workflow to process our data. We needed to convert files, install different softwares, find/choose/build pipelines and trouble-shoot when things didn’t work. So even though we were very excited about this approach in the beginning we just had to let go in the end and choose MaxQuant which was easier to use. What we really liked about the OpenMS was how it felt like we were actually building something on our own, which required more research and detective work along with problem solving skills. For example, Magnus and I both really appreciated the different pipelines we tried and how you could combine them in different ways in TOPPAS. Oh well, maybe in the future, who knows…
What are the pros and cons with MaxQuant?
Peptide identification rates
Peptide mass accuracy
Proteome-wide protein quantification
[more on advantages and disadvantages coming soon]
Hm, that’s all for now. Let’s see if and when I might have time and energy to complete this before our entry deadline at noon on Tuesday next week… After this no more entries are allowed.
Today we submitted our poster for the project. Exciting!! Let’s see how it withstands the scrutiny of our teachers and fellow students on the poster session on Friday… And: what interesting feedback we can get. I think Eva and I are especially excited since this is actually our first time and therefore very special for us… 🙂
We made the poster in Adobe Illustrator, using one of Magnus’ previous posters as a template. In the end, all three of us took turns editing on Magnus’ computer and adjusting our pictures and text. Real teamwork! We decided to make the text real short in order to fit more pictures and tables in a decent size – and still have some space. I did two pie charts to show the protein topology. These are based on UniProt and Phobius predictions, but unfortunately do not currently show the discrepancy between the Phobius prediction and the UniProt, which might be a little bit unfortunate. To be honest, this is mainly due to lack of time, since it was easier this way.
Basically, these are pie charts of the same statistics I published here. Most of the proteins with signal peptides were non-cytoplasmic, but a few contained transmembrane regions meaning that they had bot cytoplasmic and non-cytoplasmic regions.
I would have liked to have a scatterplot of all of the top 100 proteins – or at least top 10 – based on iBAQ values and how these varied over time, but we didn’t have time to do this.
Eva summarized the gene ontology for the top 100 high intensity proteins and made a nice pie chart (see below). She also put together a list of the 10 highest ranking proteins from the top 100 list to give some more information of the most abundant proteins. Apparently, half of them are involved in protein folding or nucleosome assembly. You can read more about what Eva did here.
Now, the next step will be to go through everything to make sure we can explain the project properly step by step, prepare our “elevator pitches” and try to prepare for tricky questions. Obviously, we need to know a lot by heart since the text will not be of much help…
I have to admit I am a bit nervous, but I guess we should focus on having fun and just learning.
Today we have been looking at our data and generated charts, trying to find the best and most interesting/relevant way to picture and describe our results. I have written the abstract for the poster and the conclusions section is on the way. Since Magnus has a nice poster template and Adobe Illustrator, he will be the one to put everything together after we have the texts and the figures.
I’m excited to make our poster and attend the session on Friday, but as always I wish we had more time to look at the data and try some alternative approaches, especially continuing with OpenMS and TOPPAS. Oh well…
I will update with some pictures and conclusions after we have submitted the poster tomorrow at noon.
Last week we finally got our results in MaxQuant and were very excited. Since then we have been thinking of how to examine and analyze them, and how to make sense of what we have got.
1) intensity based approach:
After determining the iBAQ average and sorting out the 100 proteinIDs with the highest intensities we ran them in UniProt to find out more information about what we had. I ran all the data in Phobius as well, to see what it would predict for signal peptides (SP) and transmembrane (TM) regions. To count the entries with specific information, like “proteaome” or “signal peptide”, I used the Excel-function COUNTIF to see how many out of all entries that belonged to the category I was interested in.
For example, to count all entries that were predicted to contain signal peptides according to Phobius, I would type:
Some statistics for the iBAQ-based results:
Proteins with transmembrane regions: 2
Proteins with signal peptides: 19
SPs predicted by Phobius: 21
2) R2-based approach
We also sorted out the proteins that had an R2 ≥ 0.8 and got 27 rows, all in all 33 proteinIDs that we ran in UniProt and Phobius. Magnus had written a Python code to a) check for linear trends over time and b) looking at differences between the first half of the experiment (time points 1-8) and the second half of the experiment (time point 10-17). We used the results from this to sort out the proteins with the largest increase/decrease in intensity.
Some statistics for the R2-based approach:
Proteins with transmembrane regions: 5 – 4 with 1 TM region, 1 with 2 TM regions
Proteins with signal peptides: 3
SPs predicted by Phobius: 4
3) Ratio of secreted proteins over time
Magnus has been programming in Python to write code for studying how the ratio of secreted proteins to all proteins vary over time. Over time, cells will die and then leak intracellular proteins and with the help of Magnus’ program we can study this trend. Check out what Magnus have been working on here, on his blog (here you can also read the code for the program studying linear regression).
A couple of days ago, we finally got our results for the whole raw data set using the MaxQuant software. All in all, we have around 2600 identified and quantified protein groups in an MaxQuant “proteinGroups” outputfile.
So, what have we done so far with our MaxQuant results?
1) Magnus ran the reference proteome of Chinese hamster in Phobius to predict transmembrane (TM) topology and signal peptides (SP) from the amino acid sequence (in .fasta format from UniProt) of the proteins, outputting a short format for this.
“If the whole sequence is labeled as cytoplasmic or non cytoplasmic, the prediction is that it contains no membrane helices. /…/ The prediction gives the most probable location and orientation of transmembrane helices in the sequence.”
2) Magnus has used his Python programming skills for linear regression of the proteins. The idea is that we will look at the proteinIDs that increase and decrease the most over time, that is, the ones with the steepest slopes.
3) Eva and I sorted the proteins based on R2 ≥ 0.8 and found that (luckily?!) only 27 of the proteinIDs had R2 value over this threshold.
We used UniProt to find out what proteins we had and more info on them. Actually, we started out doing this one protein ID at a time in UniProt, collecting mainly information about the protein name and whether there was a signal peptide there or not. We started using Phobius to check for TM regions and SPs, before remembering that Magnus had already run Phobius the other day.
4) We also decided on an approach where we choose the 100 proteins with highest total intensity (iBAQ). I don’t remember who’s idea this was originally, but Eva decided to do the necessary calculations in Excel to fish out the most interesting proteins. She determined the average total intensity for each protein, using Excel to calculate the value, and sorted in descending order choosing the 100 proteins with largest intensity. Read more about this in Eva’s blog.
We chose to take on this approach together with the linear regression approach since we thought it could be two interesting ways to find important proteins that vary over time. The problem with the R2 -approach is that we based it on choosing only the proteins that are best fit to a line and fished out the ones with the steepest slope, thus completely neglecting any proteins variations that could not be described by this linear model. The idea with intensity approach is that we will look at all high intensity proteins (or rather, the top 100) so as to not miss any that do not vary greatly over time.
Eva figured out how to use UniProt to analyze all our proteins to output the information that we want, for example if they contain signal peptides, have transmembrane regions or not and their GO (gene ontology) information. Very convenient. So we did this for first the 100 proteins with the largest average intensity (iBAQ) and then also for the proteins selected based on R2-value.
Protein ID input
The columns we chose were:
Protein names, EC number, Function [CC], Gene ontology (GO), Gene ontology (biological process), Gene ontology (molecular function), Gene ontology (cellular component), Protein families, Signal peptide, Intramembrane, Topological domain, Transmembrane, Subcellular location [CC]
5) We decided that I will be in charge of double checking this information against Phobius, just to make sure that there is no difference in the outcome – and to make notes in case there would be any difference.
6) Magnus also suggested we should calculate the ratios of secreted proteins/cellular proteins at the beginning and at the end of the experiment and compare them to give us an idea of how the leakage products increase or decrease with time in the bioreactor. Let’s see if we have time for this…
Actually, I lost track of how many meetings we have had in the group by now. Usually, we meet quite casually to just work side by side, sometimes all of us, sometimes only two of us and of course some work has been carried out individually. For example, Eva and Magnus met yesterday but I could not attend due to previous engagements so I will work this week-end instead.
So, we decided to run MaxQuant instead of OpenMS (TOPPAS). The good news is that we finally got all of the data some days ago and have managed to run all of it in MaxQuant and got results. Now we are looking at how to analyze the data. While Eva and Magnus have been working more on running the data and handling the MaxQuant output, I have been reading up on MaxQuant and how it works. I will post an update about this later today, or tomorrow.
We have also been discussing how to analyze the data further to draw biological conclusions and we got some more suggestions from Lukas during the last seminar on Wednesday. One suggestion is to look at the linear regression to see which proteins vary most over time and use this as way to chose the proteins to look closer at. Another suggestion is to look at what genes are mainly expressed.
I was extremely tired at the seminar so I must admit that I had problems following Eva’s and Lukas’ reasoning. At least it seemed Lukas was happy with what we had so far and did not seem worried about our time plan or the project outcome. Such a relief!
During the seminar Lukas and Olof, one of the other teachers of this course, mainly informed us about what will happen during the poster presentation event next Friday and gave us advice on how to make a good poster. Admittedly, I have never made a poster before, but I feel it is pretty straight forward how to do it. Then let’s see how well we manage… In any case, I am really looking forward to this!
Today we decided to drop OpneMS and focus on MaxQuant. OpenMS was fun and seemed promising but in the end it was too time-consuming now that we haven’t made faster progress with everything (installation, choosing pipelines, getting the correct file formats, successfully running the workflows and reading the output). Simply put. I really think it’s a shame, because I thought it was really fun and wanted to learn and make it work. Then again, the presentation is next week and we need results the coming days to we had to make priorities. I lost all of last week since I was sick and had to work after that, so I am definitely behind. Now my focus is on reading up on MaxQuant and to help Eva and Magnus understand the software and results and to interpret the output files. I started this today and will make a new update tomorrow, when I have more time and know more. For reasons already mentioned, I won’t run MaxQuant on my computer so Eva and Magnus are in charge of that. I will help doing background research and interpreting the data instead. At least that is the plan right now.
This is what the MaxQuant output looks like:
This is just from a test run though, with only two time-points (not sure how many replicates Magnus ran): day 3 and day 5. I am now in charge of making sense of this file and the information it contains… Let’s see how that goes!
I will update with more information on MaxQuant and the output, as soon as possible.