Load dataset

To load a dataset correctly into bioScience, you need to know that the gene expression values must contain numerical values and that the genes must be in the rows and their experimental conditions in the columns. Based on this, bioScience offers several advantages that allow it to be compatible for data sets generated in a variety of ways:

  • Compatibility with artificially generated (synthetic) datasets and datasets generated by any sequencing technology such as microarrays or RNA-Seq.

  • Additional information such as gene names or features specific to the sequencing technologies, such as gene length for RNA-Seq, will be properly stored in the library so that this information can be used for proper data processing. For example, a practical case taken into account by bioScience is that, if the dataset provides gene length information, the user can apply normalisation methods that depend on this particular feature, such as the TPM normalisation method.

  • It is not necessary that the columns of the dataset are separated under a single criterion, e.g. tabular arrays. Therefore, bioScience allows loading datasets in which the columns of the dataset may be separated differently. In this case, the load function has a separator parameter that allows datasets to be loaded based on a user-defined separator character. To understand the meaning of each attribute in this load function you can access the API reference.

  • This library is compatible with data sets containing expression values that are not pre-processed, or have been previously pre-processed or even undergone a transformation process to binarise their expression values.

Load gene co-expression dataset (microarray, synthetic and generic)

This option allows the loading of a microarray, synthetic or generic dataset.

import bioscience as bs
dataset = bs.load(path="datasets/synthetic.txt", index_gene=0, naFilter=True, head = 0)

To understand the meaning of each attribute in this load function you can access the API reference.

Load RNA-Seq dataset

The following source code allows the loading of a dataset of type RNA-Seq.

import bioscience as bs
dataset = load(path="datasets/rnaseq.txt", index_gene=0, index_lengths=1 ,naFilter=True, head = 0)

To understand the meaning of each attribute in this load function you can access the API reference.

Load binary dataset

bioScience also allows loading of binary datasets because certain data mining algorithms only support this type of data. To do so, the user can either perform a direct load of a previously binarised dataset or load his dataset and binarise it internally with the bioScience library.

import bioscience as bs
dataset = bs.load(path="datasets/binary.txt", index_gene=0, naFilter=False, head = 0)

To understand the meaning of each attribute in this load function you can access the API reference.