Skip to content

Latest commit

 

History

History
42 lines (30 loc) · 1.45 KB

data_management_01_01.md

File metadata and controls

42 lines (30 loc) · 1.45 KB

Why Do We Need Data Management?

Digital research, especially biological, creates larges amounts of complex data

  • 3 Billion base-pairs in the 23 human chromosome pairs
  • 20,000+ humangenes
  • 60,000+ human protein variants
  • measurement of the expression patterns of all these requires many files for each element of the biology central dogma

  • Data complexity
    • huge numbers of files, particularly counting the meta-data relating data-to-data
    • large storage capacity needed, either for individual files or collectively

  • This data varies in format and type
    • raw text
    • delimited text
    • binary (not human readable)
    • often extreme differences in storage requirements or limitations
  • Without a management plan:
    • protecting, sharing, and even locating data can be a challenge
  • With a plan:
    • researchers can focus on their areas of expertise
    • management policies can be automated
    • scientific replication, open-access, and cross-collections can be created, curated, and maintained

Example:

  • Given 100 similarly named files in a directory file0.dat - file100.dat

    • DISCUSS: what can we guess about this data?
      • very little
  • Over these modules, we'll discuss data organization best practices for improved computational efficiency, performance, and security.


Next: File Systems | UP: Data Management Overview | Top: Course Overview