Data Science

Making tables in Rmarkdown: {DT} and {kableExtra}

{DT} is a package to render html tables. It is an interface to the datatables javascript library. It should not be confused with the {data.table} package, which is a package useful for data wrangling. A similar package of the same purpose is {kableExtra}. I found that {kableExtra} is more suitable for making static tables, whereas {DT} is more suitable for making interactive tables. DT Adding captions DT::datatable(iris[1:10,],caption = htmltools::tags$caption( style = 'caption-side: top; text-align: center; color:black; font-size:200% ;','Table1: Iris Dataset Table') )

Rmarkdown and markdown notes

This contains notes for Rmarkdown and markdown. All notes for markdown are generally applicable for Rmarkdown. markdown Footnote Rmarkdown Markdown extras Adding toggle <details><summary>toggle title</summary> toggle content </details> Image quality knitr::opts_chunk$set(dpi=300)

String processing in R

Case snakecase::to_any_case() help to fix the case of words - useful for converting between presentation data and processing data. Regex The regex in R is not 100% perl flavored. For example, escape character is \\ instead of \. I love the [{rev}]( package). The most salient point that this package solves is on the interpretation of regex. When not to use regex Not all strings are interpreted as a regex. Sometimes, one needs to opt-in via a parameter in the function.

The bright side of plots (R plot notes)

Font Font size Font format (e.g. subscripts) Combine plots Helpful packages: 3D rendering It will be useful to consult Rmarkdown notes because I often use Rmarkdown to render ggplot. Font Font size Global adjustment (e.g. the default font size is small when rendering with Rmarkdown with ) theme’s basic size See individual adjustment here and also in this RStudio2021 conference talk Font format (e.

Working with NA in R

NA are necessary markers for missing data. However, Working with them can be tricky because of their special properties. Care should also be taken when reading in and presenting the data. Properties of NA Types There are different types of NA that are denoted by the NA_*. This shhould be noted when working with NA data in a data.frame. Operations like case_when require all output data to be of the same type.

Learning functional programming in R

Why use functional programming Avoid intermediate objects In any loop, the standard practice is to create a new list before the loop, do some processing for each element of the list in the loop and then add the processing result as an element to the new list (following the same index). It makes programming more fun Thinking about that index i is simply not as fun as working with the whole list.

Research tools

Semantic scholar A NLP-powered “PubMed” that generates quick summaries for articles Scite For each article, it indicates the nature of the citation i.e. approving, neutral, disproving Meta A research feed generator Others Connected Papers and CORE. Both are available on aRxiv and show relationship between papers. Connected Papers show a map.

Caveats when working with bioinformatics data

This documents the common pitfalls when working with Bioinformatics data and how to prevent them. Headers Case use janitor::clean_names to standardize names to snakecases. Names use a standardized name: chr for chromosome, instead of chrom, seqnames etc. Sometimes you have to change the name to fit a certain software (e.g. GenomicRanages), but only convert the name within the call of the function itself, and immediately change back. Never propagate the name change to the next function because it will then be a headache to deal with the dependencies between functions.

How to split a string column by length

Intro This is a documentation of how I split a string type column by its length, and combine them together in a directory format (which was a necessary step for me to check whether each directory existed in my analysis). library(tidyverse) data <- tibble(string = c("123456", "987654")) print(data) ## # A tibble: 2 x 1 ## string ## <chr> ## 1 123456 ## 2 987654 Step 1 strsplit splits the string into a list of strings, and in tibble it will show up as a column of list type.

How I would learn programming in 7 days

This will be a part of a series of articles on learning programming and data science. There are many articles on this topic already, but these are for my friends. This post focuses on learning programming. Most data scientists use Python and R. Between the two, I think Python is a more programming-oriented language. The types of objects are more straightforward, the syntax is easier, the object-oriented approach is clearer, too.