Complete course Intro to SQL for Data Science and chapters Importing data from databases Part 1 and 2 DataCamp. Read the RSQLite vignette.
There are essentially three ways to send SQL-queries directly from RStudio to a database mybase
, where we assume mybase
is an SQLite database stored in file mybase.sqlite
.
dbGetQuery
after establishing a conection using dbConnect
as incon <- dbConnect(SQLite(), "mybase.sqlite")
dbGetQuery(con, "SELECT * FROM table")
```{r}
con <- dbConnect(SQLite(), "mybase.sqlite")
```
```{sql, connection = con}
SELECT * FROM table
```
We recommend using the first option within an R script file.
Practise SQL joins at w3schools.
The SQLite database Class_files/sthlm_metro.sqlite
contains data on the stops of the Stockholm metro at the platform level, which was generated for the 2018 class by Class_files/sthlm_metro.R
and originates from Trafiklab.se.
Connect to the database, list the tables and figure out how they relate to eachother.
LineName
from table Line
where LineNumber
is 18.StationName
, where position is measured as the average Latitude
of its PlatformNumber
.LineNumber
18.Query for all StationName
on LineNumber
18 in alphabetical order.
The SQLite database Class_files/pokedex.sqlite
contains the full set of tables from https://github.com/veekun/pokedex/tree/master/pokedex/data/csv generated by the file Class_files/pokedexDB.R
.
Close any existing connections with dbDisconnect
and connect to the database.
height
and average weight
of pokemon (pokemon
table).weight
.ghost
-type pokemon.pokemon_id
, pokemon_name
and six additional columns giving the pokemons base_stat
-value in each of categories hp
, attack
, defense
, special-attack
, special-defense
and speed
(see tables stats
and pokemon_stats
). You may use SQL to ask for the table in long format and convert it to the final wide format using spread
in R.Ensembl hosts a public genomic database with a MySQL server. You can connect to the database holding the human genome using (note the use of MySQL
rather than SQLite
)
library(RMySQL)
## Loading required package: DBI
con <- dbConnect(MySQL(), host = "ensembldb.ensembl.org",
user = "anonymous", password = "",
port = 3306)
we are now connected to a whole set of databases
databases <- dbGetQuery(con, "SHOW DATABASES")
head(databases)
## Database
## 1 information_schema
## 2 PERCONA_SCHEMA
## 3 acanthochromis_polyacanthus_core_94_1
## 4 acanthochromis_polyacanthus_core_95_1
## 5 acanthochromis_polyacanthus_core_96_1
## 6 acanthochromis_polyacanthus_core_97_1
nrow(databases)
## [1] 10519
Yes, plenty of databases. The latest version of the human genome is in homo_sapiens_core_94_38
, we choose this by
dbSendQuery(con, "USE homo_sapiens_core_94_38")
## <MySQLResult:0,0,1>
dbListTables(con)
## [1] "alt_allele"
## [2] "alt_allele_attrib"
## [3] "alt_allele_group"
## [4] "analysis"
## [5] "analysis_description"
## [6] "assembly"
## [7] "assembly_exception"
## [8] "associated_group"
## [9] "associated_xref"
## [10] "attrib_type"
## [11] "biotype"
## [12] "coord_system"
## [13] "data_file"
## [14] "density_feature"
## [15] "density_type"
## [16] "dependent_xref"
## [17] "ditag"
## [18] "ditag_feature"
## [19] "dna"
## [20] "dna_align_feature"
## [21] "dna_align_feature_attrib"
## [22] "exon"
## [23] "exon_transcript"
## [24] "external_db"
## [25] "external_synonym"
## [26] "gene"
## [27] "gene_archive"
## [28] "gene_attrib"
## [29] "genome_statistics"
## [30] "identity_xref"
## [31] "interpro"
## [32] "intron_supporting_evidence"
## [33] "karyotype"
## [34] "map"
## [35] "mapping_session"
## [36] "mapping_set"
## [37] "marker"
## [38] "marker_feature"
## [39] "marker_map_location"
## [40] "marker_synonym"
## [41] "meta"
## [42] "meta_coord"
## [43] "misc_attrib"
## [44] "misc_feature"
## [45] "misc_feature_misc_set"
## [46] "misc_set"
## [47] "object_xref"
## [48] "ontology_xref"
## [49] "operon"
## [50] "operon_transcript"
## [51] "operon_transcript_gene"
## [52] "peptide_archive"
## [53] "prediction_exon"
## [54] "prediction_transcript"
## [55] "protein_align_feature"
## [56] "protein_feature"
## [57] "repeat_consensus"
## [58] "repeat_feature"
## [59] "seq_region"
## [60] "seq_region_attrib"
## [61] "seq_region_mapping"
## [62] "seq_region_synonym"
## [63] "simple_feature"
## [64] "stable_id_event"
## [65] "supporting_feature"
## [66] "transcript"
## [67] "transcript_attrib"
## [68] "transcript_intron_supporting_evidence"
## [69] "transcript_supporting_feature"
## [70] "translation"
## [71] "translation_attrib"
## [72] "unmapped_object"
## [73] "unmapped_reason"
## [74] "xref"
Check the assembly table if you wonder how the tables relate to eachother… We may take a closer look at a table by DESCRIBE
dbGetQuery(con, "DESCRIBE gene")
## Field Type Null Key Default
## 1 gene_id int(10) unsigned NO PRI <NA>
## 2 biotype varchar(40) NO <NA>
## 3 analysis_id smallint(5) unsigned NO MUL <NA>
## 4 seq_region_id int(10) unsigned NO MUL <NA>
## 5 seq_region_start int(10) unsigned NO <NA>
## 6 seq_region_end int(10) unsigned NO <NA>
## 7 seq_region_strand tinyint(2) NO <NA>
## 8 display_xref_id int(10) unsigned YES MUL <NA>
## 9 source varchar(40) NO <NA>
## 10 description text YES <NA>
## 11 is_current tinyint(1) NO 1
## 12 canonical_transcript_id int(10) unsigned NO MUL <NA>
## 13 stable_id varchar(128) YES MUL <NA>
## 14 version smallint(5) unsigned YES <NA>
## 15 created_date datetime YES <NA>
## 16 modified_date datetime YES <NA>
## Extra
## 1 auto_increment
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10
## 11
## 12
## 13
## 14
## 15
## 16
gene
-table?biotype
in the gene
-table).dna
table.