data sequence and customized csv dataset #2733

wangjiawen2013 · 2025-01-22T08:37:30Z

wangjiawen2013
Jan 22, 2025

Hi,
We can construct a InMemDataset from a csv file according to burn's example (https://github.com/tracel-ai/burn/blob/main/examples/custom-csv-dataset/src/dataset.rs).

But, when the csv is very wide (such as having 1000 columns), it is impossible to construct a struct with all the columns as fields manually. Are there an easy way ?

Besides, how to construct a tensor from a digital string ? Here is my intention:

let s = String::from("1, 2, 3, 4, 5, 6");
let tensor = Tensor::<B, 1>::from_floats(s);  // I want a tensor [1, 2, 3, 4, 5, 6], but this will not work

They may be useful when implementing a LSTM as sequences are needed.

laggui · 2025-01-22T18:31:09Z

laggui
Jan 22, 2025
Maintainer

But, when the csv is very wide (such as having 1000 columns), it is impossible to construct a struct with all the columns as fields manually. Are there an easy way ?

The InMemDataset::from_csv(...) method is there for convenience. It simply uses the csv crate to parse each record and deserializes it into the provided struct thanks to serde.

burn/crates/burn-dataset/src/dataset/in_memory.rs

Lines 72 to 88 in 245fbcd

    
               pub fn from_csv<P: AsRef<Path>>( 
        
                   path: P, 
        
                   builder: &csv::ReaderBuilder, 
        
               ) -> Result<Self, std::io::Error> { 
        
                   let mut rdr = builder.from_path(path)?; 
        
                   let mut items = Vec::new(); 
        
                   for result in rdr.deserialize() { 
        
                       let item: I = result?; 
        
                       items.push(item); 
        
                   } 
        
                   let dataset = Self::new(items); 
        
                   Ok(dataset) 
        
               }

If that doesn't fit your needs, you can implement your own parsing to create the InMemDataset from your collection of items.

Besides, how to construct a tensor from a digital string ?
They may be useful when implementing a LSTM as sequences are needed.

You cannot construct a tensor from a string. For NLP tasks, you need to go from the string representation to tokens. This can be done in many different ways, so the implementation is up to the user.

Modern techniques involve tokenization, where strings (e.g., sentences) are split into smaller units (e.g., words, subwords, or characters) called tokens, and these tokens are mapped to unique integers using a vocabulary. See for example the tokenizer in the text classification example.

1 reply

wangjiawen2013 Jan 23, 2025
Author

I mean it's impracticable to construct an structure with too many (1000) fields as the datatype of the dataset item. As you can see from burn's example: https://github.com/tracel-ai/burn/blob/main/examples/custom-csv-dataset/src/dataset.rs

/// Diabetes patient record.
/// For each field, we manually specify the expected header name for serde as all names
/// are capitalized and some field names are not very informative.
#[derive(Deserialize, Serialize, Debug, Clone)]
pub struct DiabetesPatient {
    /// Age in years
    #[serde(rename = "AGE")]
    pub age: u8,

    /// Sex categorical label
    #[serde(rename = "SEX")]
    pub sex: u8,

    /// Body mass index
    #[serde(rename = "BMI")]
    pub bmi: f32,

    /// Average blood pressure
    #[serde(rename = "BP")]
    pub bp: f32,

    /// S1: total serum cholesterol
    #[serde(rename = "S1")]
    pub tc: u16,

    /// S2: low-density lipoproteins
    #[serde(rename = "S2")]
    pub ldl: f32,

    /// S3: high-density lipoproteins
    #[serde(rename = "S3")]
    pub hdl: f32,

    /// S4: total cholesterol
    #[serde(rename = "S4")]
    pub tch: f32,

    /// S5: possibly log of serum triglycerides level
    #[serde(rename = "S5")]
    pub ltg: f32,

    /// S6: blood sugar level
    #[serde(rename = "S6")]
    pub glu: u8,

    /// Y: quantitative measure of disease progression one year after baseline
    #[serde(rename = "Y")]
    pub response: u16,
}

Though we can parse the csv using from_csv, we still need to specify the data type of the dataset item manually, so we must define a structure before using the dataset. It is necessary to find an easy way to specify the data type when the dataset includes too many fields/columns.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data sequence and customized csv dataset #2733

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

data sequence and customized csv dataset #2733

wangjiawen2013 Jan 22, 2025

Replies: 1 comment · 1 reply

laggui Jan 22, 2025 Maintainer

wangjiawen2013 Jan 23, 2025 Author

wangjiawen2013
Jan 22, 2025

Replies: 1 comment 1 reply

laggui
Jan 22, 2025
Maintainer

wangjiawen2013 Jan 23, 2025
Author