Abstract Enormous amounts of biomedical data have been and are being produced at an unprecedented rate by researchers all over the world. However, in order to enable reuse, there is an urgent need to understand the structure of datasets, the experimental conditions under which they were produced and the information that other investigators may need to make sense of the data.
That is, there is a need for accurate, structured and complete description of the data — defined as metadata. Good quality metadata is essential in finding, interpreting, and reusing existing data beyond what the original investigators envisioned. Progenetix, as a public resource of copy number variants, already has various of metadata, including sex, age, tumor stage, grade, tnm and so on. While these clinical information are still far from enough compared to the amount of biosamples in progenetix, which urgently needs to be replenished. In this talk，I will mainly represent how clinical information are stored in GEO database, how we extract them and how we replenish them in progenetix. After that, progenetix has better hierachical stage/tnm/grade cancer classification. And I will also introduce a reference copy number variatiion (CNV) database, whose data are collected from 1000 genomes project. As a case study, I will share some CNV pattern analysis of two tumor types of lung adenocarcinoma (LUDA) and lung squamous cell carcinoma (LUSC).