diff --git a/argilla/docs/community/images/annotation_screen.png b/argilla/docs/community/images/annotation_screen.png new file mode 100644 index 0000000000..0ee2e526b7 Binary files /dev/null and b/argilla/docs/community/images/annotation_screen.png differ diff --git a/argilla/docs/community/images/argilla_ds_list.png b/argilla/docs/community/images/argilla_ds_list.png new file mode 100644 index 0000000000..8aaa61d649 Binary files /dev/null and b/argilla/docs/community/images/argilla_ds_list.png differ diff --git a/argilla/docs/community/images/argilla_ds_list_settings.png b/argilla/docs/community/images/argilla_ds_list_settings.png new file mode 100644 index 0000000000..595f17bf02 Binary files /dev/null and b/argilla/docs/community/images/argilla_ds_list_settings.png differ diff --git a/argilla/docs/community/images/argilla_ds_settings.png b/argilla/docs/community/images/argilla_ds_settings.png new file mode 100644 index 0000000000..1f33f6993f Binary files /dev/null and b/argilla/docs/community/images/argilla_ds_settings.png differ diff --git a/argilla/docs/community/images/autotrain_screen1.png b/argilla/docs/community/images/autotrain_screen1.png new file mode 100644 index 0000000000..94e0c2a1c6 Binary files /dev/null and b/argilla/docs/community/images/autotrain_screen1.png differ diff --git a/argilla/docs/community/images/autotrain_screen2.png b/argilla/docs/community/images/autotrain_screen2.png new file mode 100644 index 0000000000..fa0e5b3fdd Binary files /dev/null and b/argilla/docs/community/images/autotrain_screen2.png differ diff --git a/argilla/docs/community/images/autotrain_ui.png b/argilla/docs/community/images/autotrain_ui.png new file mode 100644 index 0000000000..c3b3040808 Binary files /dev/null and b/argilla/docs/community/images/autotrain_ui.png differ diff --git a/argilla/docs/community/sample_publications.csv b/argilla/docs/community/sample_publications.csv new file mode 100644 index 0000000000..f3819b1b30 --- /dev/null +++ b/argilla/docs/community/sample_publications.csv @@ -0,0 +1,150 @@ +publication_number,sequence_id,tokens +US-4444749-A,0,A shampoo comprising an aqueous solution of an anionic detergent and an effective amount of the reaction product resulting from the reaction in an aqueous medium of (i) a polymer of maleic anhydride and an ethylenically unsaturated monomer with (ii) a primary-tertiary polyamine or a secondary-tertiary polyamine. +US-4444749-A,1,A shampoo comprising an aqueous solution of an amphoteric detergent and an effective amount of the reaction product resulting from the reaction in an aqueous medium of (i) a polymer of maleic anhydride and an ethylenically unsaturated monomer with (ii) a primary-tertiary polyamine or a secondary tertiary polyamine. +US-4444749-A,2,"A shampoo comprising an aqueous solution of a polymer having repeating units of the formula ##STR3## wherein R and R' each independently represent a member selected from the group consisting of hydrogen, --OCH 3 , --OCH 2 CH 3 and phenyl with the proviso that one of R and R' is hydrogen; R 1 represents hydrogen or lower alkyl having 1-4 carbon atoms; R 2 represents lower alkyl having 1-4 carbon atoms; R 4 is alkylene containing 2-6 carbon atoms; R 3 is selected from the group consisting of lower alkyl containing 1-6 carbon atoms and --R 4 --N(R 2 ) 2 wherein R 2 and R 4 have the meanings given above; the molar ratio of p/q ranging between 1:1 to 1:0.7; and n is 2-10; said polymer being present in said composition in an amount of 0.5-10 percent by weight thereof, said composition having a pH of 3.5 to 10; and an anionic detergent." +US-4444749-A,3,"A shampoo comprising an aqueous solution of a polymer having repeating units of the formula ##STR4## wherein R and R' each independently represent a member selected from the group consisting of hydrogen, --OCH 3 , --OCH 2 CH 3 and phenyl with the proviso that one of R and R' is hydrogen; R 1 represents hydrogen or lower alkyl having 1-4 carbon atoms; R 2 represents lower alkyl having 1-4 carbon atoms; R 4 is alkylene containing 2-6 carbon atoms; R 3 is selected from the group consisting of lower alkyl containing 1-6 carbon atoms and --R 4 --N(R 2 ) 2 wherein R 2 and R 4 have the meanings given above; the molar ratio of p/q ranging between 1:1 to 1:0.7; and n is 2-10; said polymer being present in said composition in an amount of 0.5-10 percent by weight thereof, said composition having a pH of 3.5 to 10; and an amphoteric detergent." +US-4135965-A,0,"The method of heat treating a mixture of solids contained within a liquid by contacting the mixture with a flow of treatment gas into a chamber having side walls extending around the central space within the chamber, the method including the concurrently performed steps of: (a) forming a film of the mixture and running the mixture down the side walls while blowing the treatment gas axially upwardly into the central space between the side walls to contact the film of mixture and form a partially treated intermediate product; (b) collecting the intermediate product running down the side walls and spraying the intermediate product downwardly into the central space in contact with said flow of treatment gas as an atomized spray having a pattern shaped to occupy said central space within the chamber while being maintained out of contact with said film running down the side walls to form a treated product; (c) establishing a high agglomerating potential emanating axially from the center of said spray pattern to agglomerate solids therein; (d) and collecting said treated product from said central space separately from said film." +US-4135965-A,1,"The method as claimed in claim 1, wherein said solids include combustible particles and wherein the method further includes introducing a flame within said central space in the path of said treatment gas and said atomized spray to burn said combustible particles and heat said film to increase evaporation of the liquid therein." +US-4135965-A,2,"The method as claimed in claim 2, wherein said spray is axially downwardly directed in counterflow against said treatment gas." +US-4135965-A,3,"Apparatus for heat treating a mixture of solids contained within a liquid by contacting the mixture with a flow of treatment gas, the apparatus comprising: (a) a treatment chamber having side walls including downwardly diverging side wall portions surrounding a central axis of the chamber; (b) a first mixture inlet at the top of said side wall portion and including annular means for forming a film of the mixture on the diverging side wall portions, the film running down the side walls; (c) outlet means connected with the side walls and operative to collect an intermediate enriched mixture descending from the side walls; (d) a second mixture inlet at the top of said side wall portions disposed within said annular film-forming means and including spraying means having a spraying pattern directed axially downwardly in the chamber and shaped to remain out of contact with the side portions and walls; (e) means operative to introduce said intermediate mixture from said outlet means into said second inlet; (f) means for blowing said treatment gas upwardly into said chamber into contact with said film and said spray pattern; and (g) hopper means in the chamber below the spray pattern and disposed thereopposite and operative to separate sprayed solids from said film and collect the solids and conduct them outside the chamber." +US-4135965-A,4,"Apparatus as claimed in claim 4, wherein said means for blowing gas has an outlet axially located in the chamber, and wherein said apparatus further includes ignition burner means disposed coaxially with said blowing means and opposite said spraying means in the path of said solids." +US-4135965-A,5,"Apparatus as claimed in claim 4, and further including second burner means located adjacent to said ignition burner means and connected to receive and burn a part of said intermediate enriched mixture within said chamber." +US-4135965-A,6,"Apparatus as claimed in claim 6, wherein said ignition burner and said second burner are coaxially located in the outlet of the means for blowing gas." +US-4135965-A,7,"Apparatus as claimed in claim 4, and further including high-voltage electrode means extending downwardly into the chamber beneath said inlets and terminating above said hopper means and said gas blowing means, and operative to agglomerate said solids." +US-4135965-A,8,"Apparatus as claimed in claim 4, wherein said gas blowing means comprises a pipe extending upwardly into said chamber through the hopper means, the pipe being located concentric with the axis of the chamber to avoid rotation of the gas flowing thereinto." +US-11341283-B2,0,"A method for obfuscating a hardware intellectual property (IP) design by locking the hardware IP design based at least in part on a plurality of key-bits, the method comprising: performing a plurality of obfuscation iterations for the hardware IP design, wherein each of the plurality of obfuscation iterations comprises: generating a key vulnerability matrix for a locked version of the hardware IP design and a plurality of attacks that comprises for each attack of the plurality of attacks: applying the attack to the locked version of the hardware IP design; determining whether the attack successfully extracted a correct key value for each key-bit of the plurality of key-bits; and generating a vector for the key vulnerability matrix, the vector comprising a value for each key-bit of the plurality of key-bits identifying whether the attack successfully extracted the correct key value for the key-bit; and for each key-bit of the plurality of key-bits: determining whether the key-bit is vulnerable to at least one attack in the plurality of attacks based at least in part on the values found in the key vulnerability matrix for the key-bit; and responsive to the key-bit being vulnerable to the at least one attack in the plurality of attacks: de-obfuscating a key-gate used for the key-bit within the locked version of the hardware IP design; removing the key-gate from the locked version of the hardware IP design; identifying a set of design modification solutions to mitigate the at least one attack, wherein each design modification solution identifies a location within the hardware IP design and a key-gate type; selecting a selected design modification solution from the set of design modification solutions; and inserting the key-gate type to be used for the key-bit at the location identified by the selected design modification solution into the hardware IP design." +US-11341283-B2,1,"The method of claim 1 , wherein identifying the set of design modification solutions comprises: sorting the plurality of attacks into a sorted list of attacks based at least in part on an attack severity metric for each attack of the plurality of attacks from a highest attack severity metric to a lowest attack severity metric; and performing a plurality of solution iterations, wherein each of the plurality of solution iterations comprises: selecting an attack from the sorted list of attacks, the selected attack having a next highest attack severity metric of the attacks remaining to be processed in the sorted list of attacks; selecting one or more attack mitigation rules for the selected attack; appending the one or more attack mitigation rules to a set of attack mitigation rules; generating one or more possible design modification solutions based at least in part on the set of attack mitigation rules; responsive to generating the one or more possible design modification solutions: setting the one or more possible design modification solutions as the set of design modification solutions; and responsive to at least one attack remaining to be processed in the sorted list of attacks, performing a next solution iteration of the plurality of solution iterations; and responsive to not being able to generate the one or more possible design modification solutions, exiting from performing the plurality of solution iterations." +US-11341283-B2,2,"The method of claim 2 further comprising, for each attack of the plurality of attacks: generating an accuracy for the attack for each of the plurality of obfuscation iterations performed, the accuracy based at least in part on a number of key-bits of the plurality of key-bits for which the attack successfully extracted the correct key value for the obfuscation iteration; and generating the attack severity metric for the attack based at least in part on a weighting of the accuracy for the attack for each of the plurality of obfuscation iterations performed." +US-11341283-B2,3,"The method of claim 2 , wherein generating the one or more possible design modification solutions based at least in part on the set of attack mitigation rules comprises generating the one or more possible design modification solutions by inputting the set of attack mitigation rules to a model configured to perform structural and functional analysis to interpret the set of attack mitigation rules, wherein the set of attack mitigation rules comprises one or more rules used by the model to identify the key-gate type for each possible design modification solution of the one or more possible design modification solutions and one or more rules used by the model to identify the location where to insert the key-gate type for each possible design modification solution of the one or more possible design modification solutions." +US-11341283-B2,4,"The method of claim 4 , wherein the set of attack mitigation rules are configured in a grammar comprising a set of conditions and operators configured to support statements used to specify conditions for identifying the key-gate type and the location." +US-11341283-B2,5,"The method of claim 1 further comprising: for each attack in a set of attacks: applying the attack to a benchmark IP design; and after applying the attack to the benchmark IP design: determining whether an extracted key from applying the attack matches at least a portion of an original key used in locking the benchmark IP design; responsive to the extracted key matching at least the portion of the original key, placing the attack into a first attack group; and responsive to the extracted key not matching at least the portion of the original key, placing the attack into a second attack group; applying each attack in the first attack group to the benchmark IP design sequentially, wherein a de-obfuscation is performed on the benchmark IP design after each attack in the first attack group is applied to the benchmark IP design; applying each attack in the second attack group to the benchmark IP design in parallel after applying each attack in the first attack group, wherein a de-obfuscation is performed on the benchmark IP design after each attack in the second attack group is applied to the benchmark IP design to produce a de-obfuscated IP design for a set of de-obfuscated IP designs; and generating the plurality of attacks comprising the attacks found in the first attack group and the attack found in the second attack group that produces the de-obfuscated IP design in the set of de-obfuscated IP designs having a highest number of extracted key-bits." +US-11341283-B2,6,The method of claim 6 further comprising sorting the attacks of the first attack group according to a number of correct key-bits found in the extracted key associated with each respective attack of the first attack group. +US-11341283-B2,7,"An apparatus for obfuscating a hardware intellectual property (IP) design by locking the hardware IP design based at least in part on a plurality of key-bits, the apparatus comprising at least one processor and at least one memory including computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus to at least: perform a plurality of obfuscation iterations for the hardware IP design, wherein each of the plurality of obfuscation iterations comprises: generate a key vulnerability matrix for a locked version of the hardware IP design and a plurality of attacks that comprises for each attack of the plurality of attacks: apply the attack to the locked version of the hardware IP design; determine whether the attack successfully extracted a correct key value for each key-bit of the plurality of key-bits; and generate a vector for the key vulnerability matrix, the vector comprising a value for each key-bit of the plurality of key-bits identifying whether the attack successfully extracted the correct key value for the key-bit; and for each key-bit of the plurality of key-bits: determine whether the key-bit is vulnerable to at least one attack in the plurality of attacks based at least in part on the values found in the key vulnerability matrix for the key-bit; and responsive to the key-bit being vulnerable to the at least one attack in the plurality of attacks: de-obfuscate a key-gate used for the key-bit within the locked version of the hardware IP design; remove the key-gate from the locked version of the hardware IP design; identify a set of design modification solutions to mitigate the at least one attack, wherein each design modification solution identifies a location within the hardware IP design and a key-gate type; select a selected design modification solution from the set of design modification solutions; and insert the key-gate type to be used for the key-bit at the location identified by the selected design modification solution into the hardware IP design." +US-11341283-B2,8,"The apparatus of claim 8 , wherein the at least one memory and the computer program code are configured to, with the at least one processor, cause the apparatus to identify the set of design modification solutions by: sorting the plurality of attacks into a sorted list of attacks based at least in part on an attack severity metric for each attack of the plurality of attacks from a highest attack severity metric to a lowest attack severity metric; and performing a plurality of solution iterations, wherein each of the plurality of solution iterations comprises: selecting an attack from the sorted list of attacks, the selected attack having a next highest attack severity metric of the attacks remaining to be processed in the sorted list of attacks; selecting one or more attack mitigation rules for the selected attack; appending the one or more attack mitigation rules to a set of attack mitigation rules; generating one or more possible design modification solutions based at least in part on the set of attack mitigation rules; responsive to generating the one or more possible design modification solutions: setting the one or more possible design modification solutions as the set of design modification solutions; and responsive to at least one attack remaining to be processed in the sorted list of attacks, performing a next solution iteration of the plurality of solution iterations; and responsive to not being able to generate the one or more possible design modification solutions, exiting from performing the plurality of solution iterations." +US-11341283-B2,9,"The apparatus of claim 9 , wherein the at least one memory and the computer program code are configured to, with the at least one processor, cause the apparatus to, for each attack of the plurality of attacks: generate an accuracy for the attack for each of the plurality of obfuscation iterations performed, the accuracy based at least in part on a number of key-bits of the plurality of key-bits for which the attack successfully extracted the correct key value for the obfuscation iteration; and generate the attack severity metric for the attack based at least in part on a weighting of the accuracy for the attack for each of the plurality of obfuscation iterations performed." +US-11341283-B2,10,"The apparatus of claim 9 , wherein the at least one memory and the computer program code are configured to, with the at least one processor, cause the apparatus to generate the one or more possible design modification solutions based at least in part on the set of attack mitigation rules by generating the one or more possible design modification solutions by inputting the set of attack mitigation rules to a model configured to perform structural and functional analysis to interpret the set of attack mitigation rules, wherein the set of attack mitigation rules comprises one or more rules used by the model to identify the key-gate type for each possible design modification solution of the one or more possible design modification solutions and one or more rules used by the model to identify the location where to insert the key-gate type for each possible design modification solution of the one or more possible design modification solutions." +US-11341283-B2,11,"The apparatus of claim 11 , wherein the set of attack mitigation rules are configured in a grammar comprising a set of conditions and operators configured to support statements used to specify conditions for identifying the key-gate type and the location." +US-11341283-B2,12,"The apparatus of claim 8 , wherein the at least one memory and the computer program code are configured to, with the at least one processor, cause the apparatus to: for each attack in a set of attacks: apply the attack to a benchmark IP design; and after applying the attack to the benchmark IP design: determine whether an extracted key from applying the attack matches at least a portion of an original key used in locking the benchmark IP design; responsive to the extracted key matching at least the portion of the original key, place the attack into a first attack group; and responsive to the extracted key not matching at least the portion of the original key, place the attack into a second attack group; apply each attack in the first attack group to the benchmark IP design sequentially, wherein a de-obfuscation is performed on the benchmark IP design after each attack in the first attack group is applied to the benchmark IP design; apply each attack in the second attack group to the benchmark IP design in parallel after applying each attack in the first attack group, wherein a de-obfuscation is performed on the benchmark IP design after each attack in the second attack group is applied to the benchmark IP design to produce a de-obfuscated IP design for a set of de-obfuscated IP designs; and generate the plurality of attacks comprising the attacks found in the first attack group and the attack found in the second attack group that produces the de-obfuscated IP design in the set of de-obfuscated IP designs having a highest number of extracted key-bits." +US-11341283-B2,13,"The apparatus of claim 13 , wherein the at least one memory and the computer program code are configured to, with the at least one processor, cause the apparatus to sort the attacks of the first attack group according to a number of correct key-bits found in the extracted key associated with each respective attack of the first attack group." +US-11341283-B2,14,"A non-transitory computer storage medium comprising instructions for obfuscating a hardware intellectual property (IP) design by locking the hardware IP design based at least in part on a plurality of key-bits, the instructions being configured to cause one or more processors to at least perform operations configured to: perform a plurality of obfuscation iterations for the hardware IP design, wherein each of the plurality of obfuscation iterations comprises: generate a key vulnerability matrix for a locked version of the hardware IP design and a plurality of attacks that comprises for each attack of the plurality of attacks: apply the attack to the locked version of the hardware IP design; determine whether the attack successfully extracted a correct key value for each key-bit of the plurality of key-bits; and generate a vector for the key vulnerability matrix, the vector comprising a value for each key-bit of the plurality of key-bits identifying whether the attack successfully extracted the correct key value for the key-bit; and for each key-bit of the plurality of key-bits: determine whether the key-bit is vulnerable to at least one attack in the plurality of attacks based at least in part on the values found in the key vulnerability matrix for the key-bit; and responsive to the key-bit being vulnerable to the at least one attack in the plurality of attacks: de-obfuscate a key-gate used for the key-bit within the locked version of the hardware IP design; remove the key-gate from the locked version of the hardware IP design; identify a set of design modification solutions to mitigate the at least one attack, wherein each design modification solution identifies a location within the hardware IP design and a key-gate type; select a selected design modification solution from the set of design modification solutions; and insert the key-gate type to be used for the key-bit at the location identified by the selected design modification solution into the hardware IP design." +US-11341283-B2,15,"The non-transitory computer storage medium of claim 15 , wherein the instructions are further configured to cause the one or more processors to at least perform operations configured to identify the set of design modification solutions by: sorting the plurality of attacks into a sorted list of attacks based at least in part on an attack severity metric for each attack of the plurality of attacks from a highest attack severity metric to a lowest attack severity metric; and performing a plurality of solution iterations, wherein each of the plurality of solution iterations comprises: selecting an attack from the sorted list of attacks, the selected attack having a next highest attack severity metric of the attacks remaining to be processed in the sorted list of attacks; selecting one or more attack mitigation rules for the selected attack; appending the one or more attack mitigation rules to a set of attack mitigation rules; generating one or more possible design modification solutions based at least in part on the set of attack mitigation rules; responsive to generating the one or more possible design modification solutions: setting the one or more possible design modification solutions as the set of design modification solutions; and responsive to at least one attack remaining to be processed in the sorted list of attacks, performing a next solution iteration of the plurality of solution iterations; and responsive to not being able to generate the one or more possible design modification solutions, exiting from performing the plurality of solution iterations." +US-11341283-B2,16,"The non-transitory computer storage medium of claim 16 , wherein the instructions are further configured to cause the one or more processors to at least perform operations configured to, for each attack of the plurality of attacks: generate an accuracy for the attack for each of the plurality of obfuscation iterations performed, the accuracy based at least in part on a number of key-bits of the plurality of key-bits for which the attack successfully extracted the correct key value for the obfuscation iteration; and generate the attack severity metric for the attack based at least in part on a weighting of the accuracy for the attack for each of the plurality of obfuscation iterations performed." +US-11341283-B2,17,"The non-transitory computer storage medium of claim 16 , wherein the instructions are further configured to cause the one or more processors to at least perform operations configured to generate the one or more possible design modification solutions based at least in part on the set of attack mitigation rules by generating the one or more possible design modification solutions by inputting the set of attack mitigation rules to a model configured to perform structural and functional analysis to interpret the set of attack mitigation rules, wherein the set of attack mitigation rules comprises one or more rules used by the model to identify the key-gate type for each possible design modification solution of the one or more possible design modification solutions and one or more rules used by the model to identify the location where to insert the key-gate type for each possible design modification solution of the one or more possible design modification solutions." +US-11341283-B2,18,"The non-transitory computer storage medium of claim 18 , wherein the set of attack mitigation rules are configured in a grammar comprising a set of conditions and operators configured to support statements used to specify conditions for identifying the key-gate type and the location." +US-11341283-B2,19,"The non-transitory computer storage medium of claim 15 , wherein the instructions are further configured to cause the one or more processors to at least perform operations configured to: for each attack in a set of attacks: apply the attack to a benchmark IP design; and after applying the attack to the benchmark IP design: determine whether an extracted key from applying the attack matches at least a portion of an original key used in locking the benchmark IP design; responsive to the extracted key matching at least the portion of the original key, place the attack into a first attack group; and responsive to the extracted key not matching at least the portion of the original key, place the attack into a second attack group; apply each attack in the first attack group to the benchmark IP design sequentially, wherein a de-obfuscation is performed on the benchmark IP design after each attack in the first attack group is applied to the benchmark IP design; apply each attack in the second attack group to the benchmark IP design in parallel after applying each attack in the first attack group, wherein a de-obfuscation is performed on the benchmark IP design after each attack in the second attack group is applied to the benchmark IP design to produce a de-obfuscated IP design for a set of de-obfuscated IP designs; and generate the plurality of attacks comprising the attacks found in the first attack group and the attack found in the second attack group that produces the de-obfuscated IP design in the set of de-obfuscated IP designs having a highest number of extracted key-bits." +US-11341283-B2,20,"The non-transitory computer storage medium of claim 20 , wherein the instructions are further configured to cause the one or more processors to at least perform operations configured to sort the attacks of the first attack group according to a number of correct key-bits found in the extracted key associated with each respective attack of the first attack group." +US-2022132822-A1,0,"A method of identifying an insect infestation of a stored product by detecting one or more target volatile organic compounds (VOCs) within a target fluid flow, the method comprising: heating, via a device comprising a plurality of VOC sensors, at least one of the plurality of VOC sensors to at least a first operating temperature; contacting the one or more VOC sensors with the target fluid flow; determining a set of conductance change values (ΔK i ) corresponding to each of the one or more VOC sensors contacted with the target fluid flow; and determining a gas component concentration ([X] n ) for one or more of the target VOCs within the target fluid flow based on the set of conductance change values." +US-2022132822-A1,1,"The method of claim 1 , wherein each VOC sensor of the plurality of VOC sensors includes: a substrate having a first and second side; a resistive heater circuit formed on the first side of the substrate; a sensing circuit formed on the second side of the substrate; and a chemically sensitive film formed over the sensing circuit on the second side of the substrate." +US-2022132822-A1,2,"The method of claim 1 , wherein at least one of the plurality of VOC sensors is configured to detect the presence of an egg-specific VOC." +US-2022132822-A1,3,"The method of claim 1 , wherein the method further comprises: measuring a signal conductance for the one or more VOC sensors after contacting the one or more VOC sensors with the target fluid flow; wherein the set of conductance change values (ΔK i ) is determined based on the difference between the signal conductance for each of the one or more VOC sensors contacted with the target fluid flow and a baseline conductance of each of the corresponding VOC sensors." +US-2022132822-A1,4,"The method of claim 4 , wherein the baseline conductance for the one or more VOC sensors is measured while the one or more VOC sensors are in an atmosphere absent of any target VOCs." +US-2022132822-A1,5,"The method of claim 5 , wherein the method further comprises: adjusting the baseline conductance of one or more of the VOC sensors after being contacted with at least one target VOC to match the baseline conductance of the corresponding VOC sensor prior to contact with the at least one target VOC, wherein the baseline conductance is adjusted by heating one or more of the VOC sensors to at least a second operating temperature." +US-2022132822-A1,6,"The method of claim 4 , wherein the method further comprises: contacting one or more of the plurality of VOC sensors with a sample fluid flow, the sample fluid flow being absent of any target VOCs; and measuring the baseline conductance for the one or more VOC sensors." +US-2022132822-A1,7,"The method of claim 1 , wherein the method further comprises: determining one or more specific net conductance values for one or more of the VOC sensors, wherein each specific net conductance value corresponds to one of the target VOCs." +US-2022132822-A1,8,"The method of claim 8 , wherein each specific net conductance value corresponding to a target VOC is determined by: contacting the one or more VOC sensors with a control fluid flow having a known concentration of the target VOC; measuring a test conductance for each of the one or more VOC sensors; and for each of the one or more VOC sensors, calculating a specific net conductance value based on the measured test conductance of the VOC sensor and the known concentration of the target VOC within the control fluid flow." +US-2022132822-A1,9,"The method of claim 9 , wherein the method further comprises: determining a plurality of specific net conductance values for one or more of the VOC sensors, wherein each of the specific net conductance values for each of the VOC sensors corresponds to a different target VOC." +US-2022132822-A1,10,"The method of claim 8 , wherein the gas component concentration ([X] n ) for the one or more target VOCs within the target fluid flow is determined based on the set of conductance change values and the one or more specific net conductance values for each of the one or more of VOC sensors." +US-2022132822-A1,11,"The method of claim 1 , wherein the first operating temperature is between about 180° C. and about 400° C." +US-2022132822-A1,12,"The method of claim 1 , wherein the target fluid flow is an air sample taken from within a proximity to the stored product being evaluated." +US-2022132822-A1,13,"A system for identifying an insect infestation of a stored product, the system comprising: a testing chamber enclosing a sensor array, wherein the sensor array includes a plurality of VOC sensors and at least one VOC sensor of the plurality of VOC sensors is configured to detect the presence of an egg-specific VOC; an air transfer unit configured to retrieve a fluid flow and deliver the fluid flow to the testing chamber; and a controller operatively connected to the air transfer unit and the sensor array, wherein the controller is configured to: operate the air transfer unit to retrieve the fluid flow from and deliver the fluid flow to the testing chamber, wherein one or more of the plurality of VOC sensors are in fluid contact with the fluid flow; operate the sensor array to measure a conductance for one or more of the plurality of VOC sensors; determine a set of conductance change values corresponding to each of the one or more VOC sensors; and determine a gas component concentration for one or more target VOCs within the fluid flow based on the set of conductance change values." +US-2022132822-A1,14,"The system of claim 14 , wherein at least one of the one or more target VOCs within the fluid flow is selected from a group consisting of: 11,13-hexadecadienal; 4,8-dimethyldecanal; (Z,Z)-3,6-(11R)-Dodecadien-11-olide; (Z,Z)-3,6-Dodecadienolide; (Z,Z)-5,8-(11R)-Tetradecadien-13-olide; (Z)-5-Tetradecen-13-olide; (R)-(Z)-14-Methyl-8-hexadecenal; (R)-(E)-14-Methyl-8-hexadecen-al; γ-ethyl-γ-butyrolactone; (Z,E)-9,12-Tetradecadienyl acetate; (Z,E)-9,12-Tetra-decadien-1-ol; (Z,E)-9,12-Tetradecadienal; (Z)-9-Tetradecenyl acetate; (Z)-11-Hexa-decenyl acetate; (2S,3R,1′S)-2,3-Dihydro-3,5-dimethyl-2-ethyl-6(1-methyl-2-oxobutyl)-4H-pyran-4-one; (2S,3R,1′R)-2,3-Dihydro-3,5-dimethyl-2-ethyl-6(1-methyl-2-oxobutyl)-4H-pyran-4-one; (4S,6S,7S)-7-Hydroxy-4,6-dimethylnonan-3-one; (2S,3S)-2,6-Diethyl-3,5-dimethyl-3,4-dihydro-2H-pyran; 2-Palmitoyl-cyclohexane-1,3-dione; and 2-Oleoyl-cyclo-hexane-1,3-dione." +US-2022148681-A1,0,"1 . A method for identifying one or more antigens from one or more cells of a subject that are likely to be presented on a surface of the cells, the method comprising the steps of: (a) obtaining data representing peptide sequences of each of a set of antigens, wherein said data further comprises a value indicating a likelihood of a presentation hotspot for one or more k-mer blocks associated with the peptide sequences; (b) determining, using a neural network model, a set of presentation likelihoods for the set of antigens, each presentation likelihood in the set representing the likelihood that a corresponding antigen is presented by one or more MHC alleles on the surface of the cells of the subject, the neural network model comprising: (i) two or more layers comprising a first layer and a second layer, each layer comprising one or more nodes, wherein said nodes comprise a memory location for one or more input values; (ii) a plurality of connections between nodes of said first layer and one or more nodes of said second layer; (iii) optimized parameters stored in memory locations, wherein the optimized parameters transform input values of nodes of the first layer into input values for nodes of the second layer connected to the nodes of the first layer, (iv) wherein the optimized parameters are generated using a training data set comprising: (A) training peptide sequences or data derived from training peptide sequences; (B) at least one MHC allele associated with the training peptide sequences; (C) a value indicating a likelihood of a presentation hotspot for one or more k-mer blocks of a plurality of k-mer blocks associated with the training peptide sequences, and (D) for each of one or more of the training peptide sequences, a label indicating whether the training peptide was presented by the at least one MHC allele; and (c) wherein said determining comprises: forward feeding the data representing peptide sequences of each of a set of antigens, using a computer processor, through nodes of the first layer and the second layer of the neural network model, said forward feeding comprising transforming the data as they are fed from nodes of the first layer to nodes of the second layer using the optimized parameters; generating, using a computer processor, the set of presentation likelihoods for the set of antigens from the transformed data; (d) selecting a subset of the set of antigens based on the set of presentation likelihoods to generate a set of selected antigens; and (e) returning the set of selected antigens." +US-2022148681-A1,1,"The method of claim 1 , wherein forward feeding the data representing peptide sequences further comprises inputting into the neural network model the value indicating the likelihood of a presentation hotspot for one or more k-mer blocks associated with the peptide sequences." +US-2022148681-A1,2,"The method of claim 1 , wherein a larger value of a parameter of the optimized parameters indicates a greater likelihood that a corresponding k-mer block gives rise to a presented peptide." +US-2022148681-A1,3,"The method of claim 1 , wherein a smaller value of a parameter of the optimized parameters indicates a smaller likelihood that a corresponding k-mer block gives rise to a presented peptide." +US-2022148681-A1,4,"The method of claim 1 , wherein at least one of the k-mer blocks corresponds to a proteomic location." +US-2022148681-A1,5,"The method of claim 5 , wherein the proteomic location comprises a block of n adjacent peptides, wherein n represents a hyperparameter of the neural network model." +US-2022148681-A1,6,"The method of claim 1 , wherein the peptide sequences or training peptide sequences comprise sequences having lengths between 8-15 amino acids." +US-2022148681-A1,7,"The method of claim 1 , wherein generating the set of presentation likelihoods for the set of antigens comprises: generating a dependency score for each of the one or more class I MHC alleles, the dependency scores indicating whether the class I MHC alleles will present the antigen based on the particular amino acids at the particular positions of the peptide sequence." +US-2022148681-A1,8,"The method of claim 8 , wherein generating the set of presentation likelihoods for the set of antigens further comprises: transforming the dependency scores to generate a corresponding per-allele likelihood for each class I MHC allele indicating a likelihood that the corresponding class I MHC allele will present the corresponding antigen; and combining the per-allele likelihoods to generate the presentation likelihood of the antigen." +US-2022148681-A1,9,"The method of claim 9 , wherein the transforming the dependency scores models the presentation of the antigen as mutually exclusive across the one or more class I MHC alleles." +US-2022148681-A1,10,"The method of claim 10 , generating the set of presentation likelihoods for the set of antigens further comprises: transforming a combination of the dependency scores to generate the presentation likelihood, wherein transforming the combination of the dependency scores models the presentation of the antigen as interfering between the one or more class I MHC alleles." +US-2022148681-A1,11,"The method of claim 8 , wherein the set of presentation likelihoods are further identified by at least one or more allele noninteracting features, and further comprising: applying the neural network model to the allele noninteracting features to generate a dependency score for the allele noninteracting features indicating whether the peptide sequence of the corresponding antigen will be presented based on the allele noninteracting features." +US-2022148681-A1,12,"The method of claim 12 , further comprising: combining the dependency score for each class I MHC allele in the one or more class I MHC alleles with the dependency score for the allele noninteracting feature; and transforming the combined dependency scores for each class I MHC allele to generate a per-allele likelihood for each class I MHC allele indicating a likelihood that the corresponding class I MHC allele will present the corresponding antigen; and combining the per-allele likelihoods to generate the presentation likelihood." +US-2022148681-A1,13,"The method of claim 13 , further comprising: transforming a combination of the dependency scores for each of the class I MHC alleles and the dependency score for the allele noninteracting features to generate the presentation likelihood." +US-2022148681-A1,14,"The method of claim 1 , wherein the one or more class I MHC alleles include two or more class I MHC alleles." +US-2022148681-A1,15,"The method of claim 1 , wherein the plurality of samples comprise at least one of: (a) one or more cell lines engineered to express a single MHC class I allele; (b) one or more cell lines engineered to express a plurality of MHC class I alleles; (c) one or more human cell lines obtained or derived from a plurality of patients; (d) fresh or frozen tumor samples obtained from a plurality of patients; and (e) fresh or frozen tissue samples obtained from a plurality of patients." +US-2022148681-A1,16,"The method of claim 1 , wherein the set of presentation likelihoods are further identified by at least expression levels of the one or more class I MHC alleles in the subject, as measured by RNA-seq or mass spectrometry." +US-2022148681-A1,17,"The method of claim 1 , wherein the set of numerical likelihoods are further identified by features comprising at least one of: (a) the C-terminal sequences flanking the antigen encoded peptide sequence within its source protein sequence; and (b) the N-terminal sequences flanking the antigen encoded peptide sequence within its source protein sequence." +US-2022148681-A1,18,"The method of claim 1 , further comprising generating an output for constructing a personalized cancer vaccine from the set of selected antigens." +US-2022148681-A1,19,"A computer system comprising: a computer processor; a memory storing computer program instructions that when executed by the computer processor cause the computer processor to: (a) obtain data representing peptide sequences of each of a set of antigens wherein said data further comprises a value indicating a likelihood of a presentation hotspot for one or more k-mer blocks associated with the peptide sequences; (b) determine, using a neural network model, a set of presentation likelihoods for the set of neoantigens, each presentation likelihood in the set representing the likelihood that a corresponding neoantigen is presented by one or more MHC alleles on the surface of the cells of the subject, the neural network model comprising: (i) two or more layers comprising a first layer and a second layer, each layer comprising one or more nodes, wherein said nodes comprise a memory location for one or more input values; (ii) a plurality of connections between nodes of said first layer and one or more nodes of said second layer, (iii) optimized parameters stored in memory locations, wherein the optimized parameters transform input values of nodes of the first layer into input values for nodes of the second layer connected to the nodes of the first layer, (iv) wherein the optimized parameters are generated using a training data set comprising: (A) training peptide sequences or data derived from training peptide sequences; (B) at least one MHC allele associated with the training peptide sequences; (C) a value indicating likelihood of a presentation hotspot for one or more k-mer blocks of a plurality of k-mer blocks associated with the peptide sequences; and (D) for each of one or more of the training peptide sequences, a label indicating whether the training peptide was presented by the at least one MHC allele, (c) wherein said determination of the set of presentation likelihoods comprises: forward feeding the data representing peptide sequences of each of a set of antigens, using a computer processor, through nodes of the first layer and the second layer of the neural network model, said forward feeding comprising transforming the data as they are fed from nodes of the first layer to nodes of the second layer using the optimized parameters; generating, using a computer processor, the set of presentation likelihoods for the set of antigens from the transformed data; (d) select a subset of the set of neoantigens based on the set of presentation likelihoods to generate a set of selected neoantigens; and (e) return the set of selected neoantigens." +US-5451986-A,0,"An image formation device comprising: a) a recording medium; b) an ink sheet having a dielectric layer and an ink layer disposed thereon, arranged to travel in a continuous pathway, wherein said dielectric layer is substantially transparent to a portion of the visible electromagnetic spectrum; c) a means for bringing said recording medium and said ink sheet into contact with each other and for transferring portions of said ink layer onto said recording medium thereby leaving bare areas on the sheet; d) a photoconductive powder ink; e) an exposure system; f) a two-component hopper; and g) a positively charged conductive rotating sleeve; wherein said two-component hopper is operable to receive and store a supply of said photoconductive powder ink and further operable to impart a negative charge to said photoconductive powder ink, said conductive sleeve is coupled to said two-component hopper such that said conductive sleeve has access to said photoconductive powder ink within said two-component hopper, said positively charged sleeve picking up particles of the negatively charged powder ink and carrying the powder ink to a transfer station where the exposure system directs radiation toward the sleeve with the ink sheet being position therebetween, said radiation from the exposure system passing through the transparent dielectric layer on the bare areas and striking the photoconductive powder ink and changing the negative charge to a positive charge supplied by the positively charged sleeve, the positively charged powder ink thereafter being electrostatically transferred to the ink sheet, untransferred areas on the ink sheet blocking light from the exposure system so that the powder ink on the sleeve remains negatively charged and is not attracted to the sheet; and h) means for fixing the transferred photoconductive powder ink to the bare areas of the ink sheet to thereby provide a regenerated ink sheet." +US-5451986-A,1,The image formation device of claim 1 wherein said means for fixing comprises a heat-roller. +US-5451986-A,2,The image formation device of claim 1 wherein said exposure system is operable to produce light energy corresponding to a portion of the electromagnetic spectrum. +US-5451986-A,3,"The image formation device of claim 1 which further comprises a transparent hollow cylindrical backup roller disposed so as to surround said exposure system, while being in contact with said ink sheet." +US-5451986-A,4,The image formation device of claim 4 wherein said backup roller comprises acrylic rubber. +US-5451986-A,5,The image formation device of claim 5 wherein said backup roller further comprises a silicon coating for smoothness. +US-5451986-A,6,"A method of reconditioning ink sheets comprising the steps of: a) transporting an ink sheet having a light transmissive base layer, said base layer having a first side and a second side, an electrically conductive layer disposed on said first side of said base layer, and an ink layer disposed on said second side of said base layer, through a thermal print head wherein a portion of said ink layer is transferred onto a recording medium and whereby bare areas are formed on said ink sheet; b) grounding said conductive layer; c) charging a photoconductive powder ink contained in a two-component hopper with a negative charge; d) impressing a positive bias on a rotating conductive sleeve coupled to said two-component hopper; e) forming a layer of said photoconductive powder ink on said conductive sleeve by means of electrostatic force; f) transporting said ink sheet having bare areas between said conductive sleeve and an opposed exposure system; g) exposing said layer of said photoconductive powder ink on said conductive sleeve to light from said exposure system passing through said ink sheet bare areas so that only a first portion of said photoconductive powder ink which is physically subjacent to said ink sheet bare areas assumes a positive charge; h) adhering said first portion of said photoconductive powder ink to said ink sheet bare areas by attraction of the positively charged ink to the grounded conductive layer; and i) fixing said adhered photoconductive powder ink to said ink sheet." +US-5451986-A,7,"The method of reconditioning ink sheets as claimed in claim 7, further comprising the step of applying an electrical potential to regions of said ink sheet where said ink layer has not been removed by printing, prior to said steps (f), (g) and (h) of exposing, adhering and fixing." +US-5451986-A,8,"The method of reconditioning ink sheets as claimed in claim 7, wherein said ink sheets have a number, N, of different color inks, and steps (a) through (h) are repeated N times." +US-5451986-A,9,"The method of reconditioning ink sheets as claimed in claim 7, wherein said electrically conductive layer is comprised of Indium-Tin-Oxide (ITO)." +US-5451986-A,10,"The method of reconditioning ink sheets as claimed in claim 7, wherein said transmissive base layer is comprised of a polyester film approximately 4 microns thick." +US-5451986-A,11,"In an image formation device using a transportable ink sheet having a transparent base layer, an electrically conductive layer on one side of the base layer and a conductive ink layer on an opposite side of the base layer, said device having a print head for transferring a portion of the ink layer to a recording medium whereby bare areas are formed on said ink sheet, the improvement comprising an apparatus for reconditioning the ink sheet for reuse, said apparatus including: means for applying a charge of a first polarity to the conductive layer of the ink sheet; a hopper containing photoconductive ink particles; means for applying a charge of said first polarity to the ink particles in the hopper; a rotating oppositely charged sleeve for picking up the charged ink particles and carrying the charged ink particles to a transfer station; said transfer station including an exposure system for directing radiation toward the sleeve, the ink sheet being conveyed through the transfer station between the exposure system and the sleeve, the radiation from the exposure system passing through the transparent base layer and the bare areas on the ink sheet whereby the ink particles become oppositely charged, the oppositely charged ink particles being attracted to the charged conductive layer on the ink sheet opposite the bare areas whereby the bare areas are covered with ink particles; and fixing means for fixing the transferred ink particles on the ink sheet." +US-5451986-A,12,"The apparatus of claim 12 which further comprises: means for contacting the ink layer, prior to the transfer station and applying a uniform potential thereto." +US-5451986-A,13,The apparatus of claim 13 wherein the means for contacting comprises a positively biased roller. +US-11631617-B2,0,"A fin field-effect transistor (FINFET), comprising: fins patterned in a substrate, wherein the fins comprise at least one first fin corresponding to a first FINFET device and at least one second fin corresponding to a second FINFET device; a conformal gate dielectric disposed over the fins; at least one first workfunction-setting metal disposed over the at least one first fin and at least one second workfunction-setting metal disposed over the at least one second fin, wherein the at least one first workfunction-setting metal and the at least one second workfunction-setting metal have the same thickness T; dielectric gates formed over the at least one first workfunction-setting metal, the at least one second workfunction-setting metal and the conformal gate dielectric forming gate stacks of the first FINFET device and the second FINFET device; and source and drains formed in the fins between the gate stacks, wherein the source and drains are separated from the gate stacks by inner spacers." +US-11631617-B2,1,"The FINFET of claim 1 , further comprising: shallow trench isolation STI regions in the substrate in between the fins." +US-11631617-B2,2,"The FINFET of claim 1 , wherein the conformal gate dielectric comprises a high-κ gate dielectric selected from the group consisting of: hafnium oxide (HfO 2 ), lanthanum oxide (La 2 O 3 ), and combinations thereof." +US-11631617-B2,3,"The FINFET of claim 1 , wherein the dielectric gates comprise a material selected from the group consisting of: silicon oxide (SiOx), silicon carbide (SiC), silicon oxycarbide (SiOC), and combinations thereof." +US-11631617-B2,4,"The FINFET of claim 1 , wherein the inner spacers comprise a material selected from the group consisting of: SiOx, silicon oxycarbide (SiOC), silicon nitride (SiN), silicon oxynitride (SiON), silicon carbide nitride (SiCN), silicon oxycarbon nitride (SiOCN), and combinations thereof." +US-11631617-B2,5,"The FINFET of claim 1 , wherein the same thickness T is from about 3 nm to about 10 nm." +US-11631617-B2,6,"The FINFET of claim 1 , wherein the source and drains are formed from a doped epitaxial material." +US-11631617-B2,7,"The FINFET of claim 1 , wherein the source and drains are disposed on bottom isolation regions." +US-11631617-B2,8,"The FINFET of claim 8 , wherein the bottom isolation regions comprise silicon oxide (SiOx)." +US-11631617-B2,9,"The FINFET of claim 8 , wherein the bottom isolation regions have a thickness t of from about 2 nm to about 5 nm and ranges therebetween." +US-11631617-B2,10,"A fin field-effect transistor (FINFET), comprising: fins patterned in a substrate, wherein the fins comprise at least one first fin corresponding to a first FINFET device and at least one second fin corresponding to a second FINFET device, wherein the first FINFET device comprises an n-channel FET (NFET), and wherein the second FINFET device comprises a p-channel FET (PFET); a conformal gate dielectric disposed over the fins; at least one first workfunction-setting metal disposed over the at least one first fin and at least one second workfunction-setting metal disposed over the at least one second fin, wherein the at least one first workfunction-setting metal and the at least one second workfunction-setting metal have the same thickness T; dielectric gates formed over the at least one first workfunction-setting metal, the at least one second workfunction-setting metal and the conformal gate dielectric forming gate stacks of the first FINFET device and the second FINFET device; and source and drains formed in the fins between the gate stacks, wherein the source and drains are separated from the gate stacks by inner spacers." +US-11631617-B2,11,"The FINFET of claim 11 , wherein the at least one first workfunction-setting metal is selected from the group consisting of: titanium nitride (TiN), tantalum nitride (TaN), titanium aluminide (TiAl), titanium aluminum nitride (TiAlN), titanium aluminum carbide (TiAlC), tantalum aluminide (TaAl), tantalum aluminum nitride (TaAlN), tantalum aluminum carbide (TaAlC), and combinations thereof." +US-11631617-B2,12,"The FINFET of claim 11 , wherein the at least one second workfunction-setting metal is selected from the group consisting of: TiN, TaN, tungsten (W), and combinations thereof." +US-11631617-B2,13,"The FINFET of claim 11 , further comprising: shallow trench isolation STI regions in the substrate in between the fins." +US-11631617-B2,14,"The FINFET of claim 11 , wherein the conformal gate dielectric comprises a high-κ gate dielectric selected from the group consisting of: hafnium oxide (HfO 2 ), lanthanum oxide (La 2 O 3 ), and combinations thereof." +US-11631617-B2,15,"The FINFET of claim 11 , wherein the dielectric gates comprise a material selected from the group consisting of: silicon oxide (SiOx), silicon carbide (SiC), silicon oxycarbide (SiOC), and combinations thereof." +US-11631617-B2,16,"The FINFET of claim 11 , wherein the inner spacers comprise a material selected from the group consisting of: SiOx, silicon oxycarbide (SiOC), silicon nitride (SiN), silicon oxynitride (SiON), silicon carbide nitride (SiCN), silicon oxycarbon nitride (SiOCN), and combinations thereof." +US-11631617-B2,17,"The FINFET of claim 11 , wherein the same thickness T is from about 3 nm to about 10 nm." +US-11631617-B2,18,"The FINFET of claim 11 , wherein the source and drains are disposed on bottom isolation regions." +US-11631617-B2,19,"The FINFET of claim 11 , wherein the bottom isolation regions comprise silicon oxide (SiOx) having a thickness t of from about 2 nm to about 5 nm and ranges therebetween." +US-2023104504-A1,0,1 . A cooling device comprising: a vacuum container accommodating an object to be cooled; a refrigerator port provided in the vacuum container and including a port space in which a cold head of a refrigerator configured to cool the object to be cooled is accommodated in a replaceable manner; and a pressure adjustment facility configured to supply gas to the port space to increase a pressure in the port space before the cold head is pulled out. +US-2023104504-A1,1,"The cooling device according to claim 1 , wherein the pressure adjustment facility includes: a pipe including a gas flow path communicating with the port space and drawn out from the refrigerator port; and a valve provided in the pipe, and configured to close the gas flow path during operation of the refrigerator and allow supply of the gas to the gas flow path during replacement of the cold head." +US-2023104504-A1,2,"The cooling device according to claim 2 , wherein the refrigerator port includes a bellows configured to expand and contract in a port central axis direction, and the pipe is drawn out from a room temperature side of the bellows in the refrigerator port." +US-2023104504-A1,3,"The cooling device according to claim 1 , wherein the refrigerator port includes: a sleeve surrounding the port space; and a pedestal that is a member provided on a cooling-side end portion of the sleeve, and is directly or indirectly connected to a stage of the cold head, and the cooling device further comprising: an elastic mechanism configured to apply an elastic force to the pedestal so as to increase a connecting force between the pedestal and the stage." +US-2023104504-A1,4,"The cooling device according to claim 1 , wherein the port space includes a first port space and a second port space arranged in a port central axis direction, the refrigerator port includes: a first sleeve surrounding the first port space; a second sleeve surrounding the second port space; a first pedestal that is a member provided on a cooling-side end portion of the first sleeve, and is directly or indirectly connected to a first stage of the cold head; and a second pedestal that is provided on a cooling-side end portion of the second sleeve, and is directly or indirectly connected to a second stage of the cold head, and the cooling device further comprising: a first elastic mechanism configured to apply an elastic force to the first pedestal so as to increase a connecting force between the first pedestal and the first stage; and a second elastic mechanism configured to apply an elastic force to the second pedestal so as to increase a connecting force between the second pedestal and the second stage." +US-2023104504-A1,5,"The cooling device according to claim 5 , wherein the first elastic mechanism includes a plurality of first support elements provided around the first sleeve, each of the first support elements includes an elastic member, the second elastic mechanism includes a plurality of second support elements provided around the first sleeve and the second sleeve, and each of the second support elements includes an elastic member." +US-2023104504-A1,6,"The cooling device according to claim 1 , wherein a heater configured to prevent liquefaction of the gas supplied to the port space is provided in the refrigerator port." +US-2023104504-A1,7,"A cold head replacement method comprising: in a state where a cold head of a refrigerator is disposed in a refrigerator port provided in a vacuum container, supplying gas from an outside to a port space in the refrigerator port and thereby increasing a pressure in the port space; and pulling out the cold head from the refrigerator port after the pressure in the port space is increased." +US-2023104504-A1,8,"The cold head replacement method according to claim 8 , further comprising: discharging the gas in the port space to the outside after a new cold head is disposed in the refrigerator port." +US-2022171629-A1,0,1 - 18 . (canceled) +US-2022171629-A1,1,"A multi-thread processor comprising: a plurality of sequential processing stages, each processing stage receiving computational inputs, forming computational results and context, and forwarding the computational results and context to a subsequent stage; a thread map register having a programmable sequence of thread_id entries, the thread map register providing a programmable sequence of thread_ids in a canonical manner, the thread map register providing a subsequent thread_id of the programmable sequence in response to a request; a plurality of program counters, each program counter associated with a thread_id; a plurality of register files, each register file associated with a thread_id; at least one of the sequential processing stages being a pre-fetch stage coupled to an instruction memory, the pre-fetch stage requesting an instruction according to one of the plurality of program counters which is selected according to a thread_id requested by the pre-fetch stage; the prefetch stage retrieving an instruction associated with a program counter associated with a thread_id; at least one of the sequential processing stages being a decode/execute stage operative to modify a register file, the decode/execute stage coupled to a register file associated with the particular thread_id; where at least two thread_id values are associated with unique interrupt inputs for each thread_id value, each of the unique interrupt inputs causing a change in execution of only the associated thread_id value, and not other thread_id values." +US-2022171629-A1,2,"The multi-thread processor of claim 19 where the plurality of sequential processing stages comprise, in sequence: the prefetch stage, a fetch stage, a decode stage, the decode-execute stage, an instruction execute stage, a load-store stage, and a writeback stage coupled to the decode-execute stage." +US-2022171629-A1,3,The multi-thread processor of claim 20 where the load-store stage and instruction execute stage couple computational results to the decode-execute stage. +US-2022171629-A1,4,The multi-thread processor of claim 19 where a number of thread map register entries in a canonical sequence is greater than a number of unique thread_id values. +US-2022171629-A1,5,The multi-thread processor of claim 19 where at least one of the sequential processing stages is a load-store coupled to an external memory. +US-2022171629-A1,6,The multi-thread processor of claim 23 where the external memory is subject to a stall condition and a thread_id value associated with operations to the external memory are positioned in non-sequential locations in the thread map register. +US-2022171629-A1,7,"A process for a multi-thread processor, the multi-thread processor comprising: a thread map register containing a plurality of thread_ids; a program counter array; a pipeline stage comprising, in sequence: a prefetch stage operative to retrieve an instruction from instruction memory according to a program counter associated with the thread_id, an instruction fetch stage, an instruction decode stage, an instruction decode-execute stage, an instruction execute stage, a load-store stage, and a writeback stage; an external interface coupled to the load-store stage; the process comprising: the thread map register asserting a canonical linear sequence of thread_id values to the pre-fetch stage; the pre-fetch stage retrieving a program counter value from the program counter array associated with a thread_id from the thread map register, the pre-fetch stage providing an instruction associated with the program counter value to the instruction fetch stage; the instruction decode-execute stage or the instruction execute stage generating a computational result; the writeback stage, the load-store stage, and the execute stage receiving at least one computational result and thereafter delivering the computational result back to a decode-execute stage; the decode-execute stage thereafter coupling the computational result to a register file associated with the thread_id; and where a particular thread_id value in the thread map register associated with a thread stall interval is separated from other particular thread_id values by a number of thread_id register positions corresponding to a time interval which is greater than the stall interval of the particular thread_id." +US-2022171629-A1,8,The process of claim 25 where at least one of the thread_id values of the canonical sequence of thread_id values in the thread map register is not adjacent to a same thread_id value in the canonical sequence of thread_id values. +US-2022171629-A1,9,The process of claim 25 where thread_id entries in the thread map register are dynamically changed to assign a greater or lesser number of particular thread_id values during the canonical cycle of the linear array of thread map register values. +US-2022171629-A1,10,"The process of claim 25 where each thread_id is associated with a particular interrupt input, the particular interrupt input, when asserted, causing instructions associated with a thread interrupt routine to be executed until the interrupt routine is completed." +US-2022171629-A1,11,The process of claim 28 where instructions associated with threads which do not have an interrupt input asserted continue to execute while a thread associated with the particular interrupt input which is asserted executes a thread interrupt routine. +US-2022171629-A1,12,The process of claim 25 where the load-store stage is coupled to an external interface. +US-2022171629-A1,13,"The process of claim 30 where the external interface is at least one of a Serial Peripheral Interface (SPI) interface, a Peripheral Component Interconnect (PCI) interface, or an interface which includes delivery of an address and data to be read or written." +US-2022171629-A1,14,"A process for a multi-thread processor providing granularity in allocation of thread assignment, the multi-thread processor operative to execute instructions for a plurality of independent threads, the multi-thread processor comprising: a plurality of pipeline stages including a pre-fetch stage requesting an instruction according to a thread_id; a thread map register having a sequence of thread_id values which are programmable, each of the plurality of independent threads associated with a particular thread_id, the thread map register being programmable to output thread_id values in a programmable order, each particular thread_id associated with one or more locations in the sequence of thread_id values; a plurality of program counters, each program counter associated with a particular one of the independent threads and associated thread_id; the prefetch stage requesting an instruction according to a thread_id causing the prefetch stage to receive a current thread_id value from the sequence of thread_id values from the thread map register; the prefetch stage thereafter requesting an instruction from an instruction memory using a program counter associated with a current thread_id value; each pipeline stage of the plurality of pipeline stages performing operations on the instruction requested by the prefetch stage; at least one pipeline stage coupled to an external interface, the external interface being associated with at least one thread having a thread stall interval; and where each thread_id value in the thread map register which is associated with a thread having a thread stall interval is separated from other thread_id values by a number of thread map register locations corresponding to a time interval which is greater than the thread stall interval." +US-2022171629-A1,15,"The process of claim 31 where the series of pipeline stages comprises the pre-fetch stage coupled, in sequence, to a decode stage, a decode-execute stage, a load-store stage, and a writeback stage coupled to the decode-execute stage, each of the pipelined stages sending a result and a thread_id to a subsequent stage." +US-2022171629-A1,16,"The process of claim 33 where the decode-execute stage includes a plurality of register files, each register file selected according to the thread_id received by the decode-execute stage." +US-2022201148-A1,0,"1 . A method for allowing a user to select and send multiple scanned documents to multiple different destinations, the method comprising: at a multi-function device: receiving multiple scan jobs separated using a pre-defined separator, wherein each scan job comprising a document having one or more pages, wherein the pre-defined separator comprises a blank page or a page including a pre-defined image; scanning multiple scan jobs to generate multiple scanned documents, wherein each scanned document corresponds to a single scan job; providing a user interface to a user displaying each scanned document and corresponding multiple different destinations for selection; and based on the user selection, sending each scanned document to the multiple selected destinations in a single submission." +US-2022201148-A1,1,"The method of claim 1 , further comprising processing multiple scanned documents to segregate into different scanned documents based on the pre-defined separator." +US-2022201148-A1,2,"The method of claim 1 , further comprising providing the user interface displaying a preview of each scanned document." +US-2022201148-A1,3,"The method of claim 1 , further comprising receiving a selection of the multiple destinations from the user for each scanned document." +US-2022201148-A1,4,"The method of claim 1 , further comprising storing multiple scanned documents in a pre-defined memory." +US-2022201148-A1,5,"The method of claim 5 , further comprising retrieving each scanned document from the pre-defined memory for display via the user interface." +US-2022201148-A1,6,"The method of claim 1 , wherein the multiple destinations comprise at least: print, email, USB, SMB, SFTP, FTP, OneDrive, DropBox™, cloud server, and Email," +US-2022201148-A1,7,"The method of claim 1 , further comprising automatically deleting each scanned document from a pre-defined memory after sending each scanned document to respective selected destinations." +US-2022201148-A1,8,"The method of claim 1 , further comprising simultaneously sending multiple scanned documents to the respective multiple selected destinations," +US-2022201148-A1,9,"The method of claim 1 , further comprising allowing the user to select multiple different destinations for each scanned document." +US-2022201148-A1,10,"The method of claim 1 , further comprising allowing the user to select a single but different destination for each scanned document." +US-2022201148-A1,11,"The method of claim 1 , further comprising mapping each scanned document to the user selected destinations." +US-2022201148-A1,12,"A multi-function device, comprising: a duplex automatic document handler (DADH) for receiving multiple scan jobs separated using a pre-defined separator, each scan job comprising a document having one or more pages, wherein the pre-defined separator comprises a blank page or a page including a pre-defined image; a scanner for scanning multiple scan jobs to generate multiple scanned documents, wherein each scanned document corresponds to a single scan job; a user interface for displaying each scanned document and corresponding multiple different destinations for selection by a user; and a network controller for sending each scanned document to the multiple selected destinations in a single submission, based on the user selection." +US-2022201148-A1,13,"The multi-function device of claim 13 , wherein the network controller is for processing multiple scanned documents to segregate into different scanned documents based on the pre-defined separator." +US-2022201148-A1,14,"The multi-function device of claim 13 , wherein the user interface is for displaying a preview of each scanned document." +US-2022201148-A1,15,"The multi-function device of claim 13 , wherein the network controller is for receiving a selection of the multiple destinations from the user for each scanned document." +US-2022201148-A1,16,"The multi-function device of claim 13 , wherein sending each scanned document comprises mapping each scanned document to multiple destinations in parallel." +US-2022201148-A1,17,"The multi-function device of claim 13 , wherein sending each scanned document comprises sending each scanned document to the multiple selected destinations in parallel." +US-2022201148-A1,18,"The multi-function device of claim 13 , further comprising a pre-defined memory to store the multiple scanned documents." +US-2022201148-A1,19,"A non-transitory computer-readable medium comprising instructions executable by a processing resource to: at an application accessible at a multi-function device: receive multiple scan jobs separated using a pre-defined separator, wherein each scan job comprising a document having one or more pages, wherein the pre-defined separator comprises a blank page or a page including a pre-defined image; scan multiple scan jobs to generate multiple scanned documents, wherein each scanned document corresponds to a single scan job; provide a user interface to a user to display each scanned document and corresponding multiple different destinations for selection; and based on the user selection, send each scanned document to user's selected destinations in a single submission." diff --git a/argilla/docs/community/token_classification_tutorial.ipynb b/argilla/docs/community/token_classification_tutorial.ipynb new file mode 100644 index 0000000000..413b8be7e2 --- /dev/null +++ b/argilla/docs/community/token_classification_tutorial.ipynb @@ -0,0 +1,2025 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VHK4-Xf-cgfI" + }, + "source": [ + "# Fine-tuning a token classification model using custom Argilla Dataset and HuggingFace AutoTrain\n", + "\n", + "We all would want to try out to solve some use case with a neat tool / techs available out there.\n", + "In this tutorial , I want to go over my learning journey to fine tune a model on US Patent text." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VLwsHGv1H7qn" + }, + "source": [ + "## 1. Introduction\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9PbUPzbHIDlA" + }, + "source": [ + "### 1.1 Background on Named Entity Recognition (NER)\n", + "\n", + "Named Entity Recognition (NER) is a fundamental task in Natural Language Processing (NLP) that involves identifying and classifying named entities in text into predefined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, etc.\n", + "\n", + "### 1.2 Importance of NER in Natural Language Processing\n", + "\n", + "NER plays a crucial role in various NLP applications, including:\n", + "- Information Retrieval\n", + "- Question Answering Systems\n", + "- Machine Translation\n", + "- Text Summarization\n", + "- Sentiment Analysis\n", + "\n", + "### 1.3 Challenges in NER for Specific Domains or Languages\n", + "\n", + "While general-purpose NER models exist, they often fall short when applied to specialized domains or less-common languages due to:\n", + "- Domain-specific terminology\n", + "- Unique entity types\n", + "- Language-specific nuances\n", + "\n", + "### 1.4 The Need for Custom, Fine-tuned Models\n", + "\n", + "To address these challenges, fine-tuning custom NER models becomes essential. This approach allows for:\n", + "- Adaptation to specific domains: A fine-tuned model can perform better on specific tasks or domains compared to general-purpose model.\n", + "- Efficiency: Fine-tuned models often require less data and computational resources to achieve good performance on specific tasks.\n", + "- Faster inference: Smaller, task-specific models run faster than larger, general purpose ones.\n", + "\n", + "### 1.5 Project Objectives and Overview\n", + "\n", + "In this project, we aim to fine-tune a custom NER model for USPTO Patents. Our objectives include:\n", + "- Use [Hugging Face Spaces](https://huggingface.co/spaces) to setup an instance of [Argilla](https://argilla.io/).\n", + "- Use [Argilla](https://argilla.io/) UI to annotate our dataset with custom labels.\n", + "- Use Hugging Face [AutoTrain](https://huggingface.co/autotrain) to create a more efficient model in terms of size and inference speed.\n", + "- Demonstrating the effectiveness of transfer learning in NER tasks.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Gv3ctU0RcgcZ" + }, + "source": [ + "## 2. Data Background\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cwomBlwA601E" + }, + "source": [ + "US Patent texts are typically long, descriptive documents about inventions. The data used in this tutorial can be accessed through the [Kaggle USPTO Competition](https://www.kaggle.com/competitions/uspto-explainable-ai). Each patent contains several fields:\n", + "- Title\n", + "- Abstract\n", + "- Claims\n", + "- Description\n", + "\n", + "For this tutorial, we'll focus on the `claims` field." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "txIoyMIBcgWq", + "jp-MarkdownHeadingCollapsed": true + }, + "source": [ + "### 2.1 Problem Statement\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yY--CceA64x9" + }, + "source": [ + " Our goal is to fine-tune a model to classify tokens in the `claims` field of a given patent." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aWU2pWV1cgT5", + "jp-MarkdownHeadingCollapsed": true + }, + "source": [ + "### 2.2 Breaking Down the Problem\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0DAXEuBh673-" + }, + "source": [ + "To achieve this goal, we need:\n", + "\n", + "1. High-quality data to fine-tune a pretrained token classification model\n", + "2. Infrastructure to execute the training" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZlM-G2oWcgOQ" + }, + "source": [ + "## 3. Create High-Quality Data with Argilla\n", + " [Argilla](https://github.com/argilla-io/argilla/) is an excellent tool for creating high-quality datasets with a user-friendly interface for labeling.\n", + " \n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IpOvBlJgAvOz", + "jp-MarkdownHeadingCollapsed": true + }, + "source": [ + "### 3.1 Setting Up Argilla on Hugging Face Spaces" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CSJdeDlZ7MUm" + }, + "source": [ + "#### 1. Visit [Hugging Face Spaces deployment page](https://huggingface.co/new-space?template=argilla/argilla-template-space)\n", + "\n", + "#### 2. Create a new space:\n", + " - Provide a name\n", + " - Select `Docker` as Space SDK\n", + " - Choose `Argilla` as Docker Template\n", + " - Leave other fields empty for simplicity\n", + " - Click on `Create Space`\n", + "\n", + "#### 3. Restart the Space\n", + "\n", + "Now you have an Argilla instance running on Hugging Face Spaces. Click on the space you created to go to the login screen of Argilla UI.\n", + "Access the UI using the credentials:\n", + "\n", + "- Username: `admin`\n", + "- Password: `12345678` [default password]\n", + "\n", + "For more options and setting up the Argilla instance for production use-cases, please refer to [Configure Argilla on Huggingface](https://docs.argilla.io/dev/getting_started/how-to-configure-argilla-on-huggingface/)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NmgWfKkMcgLS" + }, + "source": [ + "### 3.2 Create a Dataset with Argilla Python SDK" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AERFvwo2cgIg" + }, + "source": [ + "#### Step 1: Install & Import packages" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "id": "FGEVJCVXf03g" + }, + "outputs": [], + "source": [ + "!pip install -U datasets argilla autotrain-advanced==0.8.8 > install_logs.txt 2>&1" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "id": "u8NzXaP4AvO2" + }, + "outputs": [], + "source": [ + "import argilla as rg\n", + "import pandas as pd\n", + "import re\n", + "import os\n", + "import random\n", + "import torch\n", + "from IPython.display import Image, display,HTML\n", + "from datasets import load_dataset, Dataset, DatasetDict,ClassLabel,Sequence,Value,Features\n", + "from transformers import pipeline,TokenClassificationPipeline\n", + "from typing import List, Dict, Union,Tuple\n", + "from google.colab import userdata" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VrFpQB6NcgCM" + }, + "source": [ + "#### Step 2: Initialize the Argilla Client\n", + "api_url: We can get this URL by using the `https://huggingface.co/spaces//?embed=true`\n" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "id": "zoj5Fdp0f5Oe" + }, + "outputs": [], + "source": [ + "client = rg.Argilla(\n", + " api_url=\"https://bikashpatra-argilla-uspto-labelling.hf.space\",\n", + " #api_url=\"https://-.hf.space # This is url to my public space.\n", + " api_key=\"admin.apikey\", # default value. Shouldn't be used for production.\n", + " # headers={\"Authorization\": f\"Bearer {HF_TOKEN}\"}\n", + ")\n", + "#Replace `` and `` with your actual Hugging Face username and space name." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oiOom7EZgGvN" + }, + "source": [ + "#### Step 3: Configure the Dataset\n", + "To configure an Argilla dataset for token classification task, we will have to\n", + "\n", + "1. Come up with labels specific to our problem domain: I came up with some labels by using the following prompt\n", + " >suggest me some labels like \"Process\", \"Product\", \"Composition of Matter\" which can be used to annotate tokens in the claims or description section of patents filed in US\n", + "\n", + "2. We need to configure fields/columns of our dataset and [`questions`](https://docs.argilla.io/latest/how_to_guides/dataset/#questions). The `questions` parameter allows you to instruct /guide the annotator on the task.In our usecase, we shall use `labels` we created for the annotators to select when annotating pieces (tokens) of text.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "id": "RJVJI2HKgMpE" + }, + "outputs": [], + "source": [ + "# Labels for token classification\n", + "labels = [\n", + " \"Process\", \"Product\", \"Composition of Matter\", \"Method of Use\",\n", + " \"Software\", \"Hardware\", \"Algorithm\", \"System\", \"Device\",\n", + " \"Apparatus\", \"Method\", \"Machine\", \"Manufacture\", \"Design\",\n", + " \"Pharmaceutical Formulation\", \"Biotechnology\", \"Chemical Compound\",\n", + " \"Electrical Circuit\"\n", + "]\n", + "\n", + "# Dataset settings\n", + "settings = rg.Settings(\n", + " guidelines=\"Classify individual tokens according to the specified categories, ensuring that any overlapping or nested entities are accurately captured.\",\n", + " fields=[\n", + " rg.TextField(name=\"tokens\", title=\"Text\", use_markdown=True),\n", + " rg.TextField(name=\"document_id\", title=\"publication_number\", use_markdown=True),\n", + " rg.TextField(name=\"sentence_id\", title=\"sentence_id\", use_markdown=False)\n", + " ],\n", + " questions=[\n", + " rg.SpanQuestion(\n", + " name=\"span_label\",\n", + " field=\"tokens\",\n", + " labels=labels,\n", + " title=\"Classify the tokens according to the specified categories.\",\n", + " allow_overlapping=True\n", + " )\n", + " ]\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DP63-Heccf5B" + }, + "source": [ + "#### Step 4: Create dataset on Argilla instance\n", + "With the settings in places, we are ready to create our dataset using [`rg.Dataset`](https://docs.argilla.io/latest/how_to_guides/dataset/#create-a-dataset) api to create our dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "ldsAzvcxgU3N", + "outputId": "f5a2f510-9453-48ce-9386-4b692f8521f2" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/usr/local/lib/python3.10/dist-packages/argilla/datasets/_resource.py:202: UserWarning: Workspace not provided. Using default workspace: admin id: fd4fc24c-fc1f-4ffe-af41-d569432d6b50\n", + " warnings.warn(f\"Workspace not provided. Using default workspace: {workspace.name} id: {workspace.id}\")\n" + ] + }, + { + "data": { + "text/plain": [ + "Dataset(id=UUID('a187cdad-175e-4d87-989f-a529b9999bde') inserted_at=datetime.datetime(2024, 7, 28, 7, 23, 59, 902685) updated_at=datetime.datetime(2024, 7, 28, 7, 24, 1, 901701) name='claim_tokens' status='ready' guidelines='Classify individual tokens according to the specified categories, ensuring that any overlapping or nested entities are accurately captured.' allow_extra_metadata=False workspace_id=UUID('fd4fc24c-fc1f-4ffe-af41-d569432d6b50') last_activity_at=datetime.datetime(2024, 7, 28, 7, 24, 1, 901701) url=None)" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# We name the dataset as claim_tokens\n", + "rg_dataset = rg.Dataset(\n", + " name=\"claim_tokens\",\n", + " settings=settings,\n", + ")\n", + "rg_dataset.create()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8CDUB2PRgXOT" + }, + "source": [ + " After step 4 we should see the dataset created in Argilla UI. We can verify this by logging in to the Argilla UI `url https://huggingface.co/spaces/-.hf.space)` with the default credentials.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "h15PQetthT3l" + }, + "source": [ + "We can look into the settings of the dataset by clicking on the settings icon next to the dataset name.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 214 + }, + "id": "g2DIaP3nTC1e", + "outputId": "a1c575ae-4ef1-4099-d851-4b035b5bea2f" + }, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "def display_image(filename): display(Image(filename=filename))\n", + "\n", + "display_image('/content/images/argilla_ds_list_settings.png')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "E4GL9IFQhbaT" + }, + "source": [ + " The Fields tab of settings screen lists down fields we configured while creating the dataset using Python SDK." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 659 + }, + "id": "X7IRX-EJTlaJ", + "outputId": "677b46b8-ff8f-453f-f689-fa66a7d7e137" + }, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "display_image('/content/images/argilla_ds_settings.png')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tHYz9mXXhoDz" + }, + "source": [ + "#### Step 5: Insert records to the Argilla datasets" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RSuXSaTahxfa" + }, + "source": [ + "Data preparation notebook can be found [here](https://www.kaggle.com/code/boredmgr/claim-sampling)" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 112 + }, + "id": "yXoI-V-wl6uu", + "outputId": "e75a8fa3-0e18-4673-b006-be4fd74270ba" + }, + "outputs": [ + { + "data": { + "application/vnd.google.colaboratory.intrinsic+json": { + "summary": "{\n \"name\": \"claims\",\n \"rows\": 149,\n \"fields\": [\n {\n \"column\": \"publication_number\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 10,\n \"samples\": [\n \"US-2022171629-A1\",\n \"US-4135965-A\",\n \"US-5451986-A\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"sequence_id\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 5,\n \"min\": 0,\n \"max\": 20,\n \"num_unique_values\": 21,\n \"samples\": [\n 0,\n 17,\n 15\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"tokens\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 149,\n \"samples\": [\n \"The image formation device of claim 4 wherein said backup roller comprises acrylic rubber.\",\n \"The method of claim 1 further comprising: for each attack in a set of attacks: applying the attack to a benchmark IP design; and after applying the attack to the benchmark IP design: determining whether an extracted key from applying the attack matches at least a portion of an original key used in locking the benchmark IP design; responsive to the extracted key matching at least the portion of the original key, placing the attack into a first attack group; and responsive to the extracted key not matching at least the portion of the original key, placing the attack into a second attack group; applying each attack in the first attack group to the benchmark IP design sequentially, wherein a de-obfuscation is performed on the benchmark IP design after each attack in the first attack group is applied to the benchmark IP design; applying each attack in the second attack group to the benchmark IP design in parallel after applying each attack in the first attack group, wherein a de-obfuscation is performed on the benchmark IP design after each attack in the second attack group is applied to the benchmark IP design to produce a de-obfuscated IP design for a set of de-obfuscated IP designs; and generating the plurality of attacks comprising the attacks found in the first attack group and the attack found in the second attack group that produces the de-obfuscated IP design in the set of de-obfuscated IP designs having a highest number of extracted key-bits.\",\n \"The multi-thread processor of claim 19 where at least one of the sequential processing stages is a load-store coupled to an external memory.\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}", + "type": "dataframe", + "variable_name": "claims" + }, + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
publication_numbersequence_idtokens
0US-4444749-A0A shampoo comprising an aqueous solution of an...
1US-4444749-A1A shampoo comprising an aqueous solution of an...
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "\n", + "
\n", + "
\n" + ], + "text/plain": [ + " publication_number sequence_id \\\n", + "0 US-4444749-A 0 \n", + "1 US-4444749-A 1 \n", + "\n", + " tokens \n", + "0 A shampoo comprising an aqueous solution of an... \n", + "1 A shampoo comprising an aqueous solution of an... " + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "claims = pd.read_csv(\"/content/sample_publications.csv\")\n", + "claims.head(2)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WnIRb65YBUxi" + }, + "source": [ + "Here we are reading rows of the csv and mapping them to the fields we created during Argilla dataset configuration step." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 146 + }, + "id": "X_lLJyy4hq55", + "outputId": "e797cab1-8f41-47ff-c05d-0b74cc04d3d9" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
DatasetRecords: The provided batch size 256 was normalized. Using value 149.\n",
+       "
\n" + ], + "text/plain": [ + "DatasetRecords: The provided batch size \u001b[1;36m256\u001b[0m was normalized. Using value \u001b[1;36m149\u001b[0m.\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Sending records...: 100%|██████████| 1/1 [00:00<00:00, 1.71batch/s]\n" + ] + }, + { + "data": { + "text/plain": [ + "DatasetRecords(Dataset(id=UUID('a187cdad-175e-4d87-989f-a529b9999bde') inserted_at=datetime.datetime(2024, 7, 28, 7, 23, 59, 902685) updated_at=datetime.datetime(2024, 7, 28, 7, 24, 1, 901701) name='claim_tokens' status='ready' guidelines='Classify individual tokens according to the specified categories, ensuring that any overlapping or nested entities are accurately captured.' allow_extra_metadata=False workspace_id=UUID('fd4fc24c-fc1f-4ffe-af41-d569432d6b50') last_activity_at=datetime.datetime(2024, 7, 28, 7, 24, 1, 901701) url=None))" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "## We upload a csv with three columns : tokens, publication_number, sequence_id\n", + "\n", + "publication_df = pd.read_csv(\"/content/sample_publications.csv\")\n", + "## Convert dataframe rows to Argilla Records\n", + "records = [\n", + " rg.Record(\n", + " fields=\n", + " {\"tokens\": \"\".join(row[\"tokens\"])\n", + " ,'document_id':str(row['publication_number'])\n", + " ,'sentence_id':str(row['sequence_id'])\n", + " })\n", + " for _,row in publication_df.iterrows()\n", + " ]\n", + " ## Store Argilla records to Argilla Dataset\n", + "rg_dataset.records.log(records)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IYpb2F8SvyHL" + }, + "source": [ + "Once, we have records pushed to Argilla Dataset, the UI will render the records and the labels for the annotator to annotate the text.\n", + "\n", + "Check the screeshots below." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 378 + }, + "id": "rvgq_GcCUATV", + "outputId": "05658172-fd94-4f82-dadb-58d1d4fc2e8a" + }, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "display_image(\"/content/images/annotation_screen.png\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hjffnMlJwVCZ" + }, + "source": [ + "#### Step 6 : Annotate tokens in every records with appropriate labels.\n", + "Login to the Argilla UI and start annotating.\n", + "\n", + "Argilla UI : `https://huggingface.co/spaces//`\n", + "\n", + "username : `admin`\n", + "\n", + "password : `12345678`" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hS5j4Ibcxwyg" + }, + "source": [ + "\n", + "\n", + "> After annotating the data , we will have to convert Argilla Dataset to HuggingFace dataset in order to use HuggingFace AutoTrain for fine-tuning the model. HF AutoTrain allows training on CSV data too which can be uploaded from AutoTrain UI. But for this tutorial we will use Huggingface dataset.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gNCS-BwJx_XM" + }, + "source": [ + "## 4. Argilla Dataset to HuggingFace Dataset" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KJ03VaCYzAih" + }, + "source": [ + "#### Step 1: Load our annotated dataset" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": { + "id": "qfQGSsrJhmk-" + }, + "outputs": [], + "source": [ + "rg_dataset = client.datasets(\"claim_tokens\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fX8sT2LYzKix" + }, + "source": [ + "#### Step 2 : Filter the rows / records which are annotated.\n", + "For us to have quick iterations on annotation and training, we should be able to annotate a few records and train our model.We can achieve it by using the [query/filter](https://docs.argilla.io/latest/how_to_guides/query/) functionality of Argilla Dataset.\n", + "\n", + "Using [`rg.Query()`](https://docs.argilla.io/latest/how_to_guides/query/) api we can filter the records which are annotated for preparing our training dataset." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "UE8ZSgZ4GtWf", + "outputId": "40f8fef6-b38b-436f-d1d2-51d4ae2cb34d" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "{'id': '01e9b4bb-9c98-4cec-acea-dd686cddf5f0',\n", + " 'status': 'pending',\n", + " '_server_id': '0b6f16f3-c3dc-4947-ac77-8b65002bf350',\n", + " 'tokens': 'The FINFET of claim 11 , wherein the conformal gate dielectric comprises a high-κ gate dielectric selected from the group consisting of: hafnium oxide (HfO 2 ), lanthanum oxide (La 2 O 3 ), and combinations thereof.',\n", + " 'document_id': 'US-11631617-B2',\n", + " 'sentence_id': '14',\n", + " 'span_label.responses': [[{'label': 'Electrical Circuit',\n", + " 'start': 4,\n", + " 'end': 10},\n", + " {'label': 'Chemical Compound', 'start': 138, 'end': 151},\n", + " {'label': 'Chemical Compound', 'start': 162, 'end': 177}]],\n", + " 'span_label.responses.users': ['4e9588d6-e2d6-450d-82c6-b33324d94708'],\n", + " 'span_label.responses.status': ['submitted']}" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "status_filter = rg.Query(filter=rg.Filter((\"response.status\", \"==\", \"submitted\")))\n", + "\n", + "submitted = rg_dataset.records(status_filter).to_list(flatten=True)\n", + "submitted[0]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rMNJN9LPzXfW" + }, + "source": [ + "The annotated dataset cannot be fed as is to the model for fine-tuning. For token-classification task, we will have to make our data that adheres to the structure as described below.\n", + "- Dataset Structure: The dataset should typically have two main columns:\n", + " - `tokens`: A list of words/tokens for each example.\n", + " - `ner_tags`: A list of corresponding labels for each token. The labels must follow the [IOB labelling scheme](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)).\n", + "- Label Encoding: The labels should be integers, with each integer corresponding to a specific named entity tag.\n", + "Below functions will allow us to convert our Argilla dataset to the required dataset structure.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": { + "id": "vBTcMXPNHBQb" + }, + "outputs": [], + "source": [ + "def get_iob_tag_for_token(token_start:int, token_end:int, ner_spans:List[Dict[str, Union[int, str]]]) -> str:\n", + " \"\"\"\n", + " Determine the IOB tag for a given token based on its position within NER spans.\n", + "\n", + " Args:\n", + " token_start (int): The start index of the token in the text.\n", + " token_end (int): The end index of the token in the text.\n", + " ner_spans (List[Dict[str, Union[int, str]]]): A list of dictionaries containing NER span information.\n", + " Each dictionary should have 'start', 'end', and 'label' keys.\n", + "\n", + " Returns:\n", + " str: The IOB tag for the token. 'B-' prefix for the beginning of an entity,\n", + " 'I-' for inside an entity, or 'O' for outside any entity.\n", + " \"\"\"\n", + " for span in ner_spans:\n", + " if token_start >= span[\"start\"] and token_end <= span[\"end\"]:\n", + " if token_start == span[\"start\"]:\n", + " return f\"B-{span['label']}\"\n", + " else:\n", + " return f\"I-{span['label']}\"\n", + " return \"O\"\n", + "\n", + "\n", + "def extract_ner_tags(text:str, responses:List[Dict[str, Union[int, str]]]):\n", + " \"\"\"\n", + " Extract NER tags for tokens in the given text based on the provided NER responses.\n", + "\n", + " Args:\n", + " text (str): The input text to be tokenized and tagged.\n", + " responses (List[Dict[str, Union[int, str]]]): A list of dictionaries containing NER span information.\n", + " Each dictionary should have 'start', 'end', and 'label' keys.\n", + "\n", + " Returns:\n", + " List[str]: A list of IOB tags corresponding to each non-whitespace token in the text.\n", + " \"\"\"\n", + " tokens = re.split(r\"(\\s+)\", text)\n", + " ner_tags = []\n", + " current_position = 0\n", + " for token in tokens:\n", + " if token.strip():\n", + " token_start = current_position\n", + " token_end = current_position + len(token)\n", + " tag = get_iob_tag_for_token(token_start, token_end, responses)\n", + " ner_tags.append(tag)\n", + " current_position += len(token)\n", + " return ner_tags" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "o40bEmkh1hRH" + }, + "source": [ + "#### Step 3: Get tokens and theirs respective annotations" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": { + "id": "-zUyFXuxHJbW" + }, + "outputs": [], + "source": [ + "def get_tokens_ner_tags(annotated_rows) -> Tuple[List[List[str]], List[List[str]]]:\n", + " \"\"\"\n", + " Extract tokens and their corresponding NER tags from annotated rows.\n", + "\n", + " This function processes a list of annotated rows, where each row contains\n", + " tokens and span labels. It splits the tokens and extracts NER tags for each token.\n", + "\n", + " Args:\n", + " annotated_rows (List[Dict[str, Union[str, List[Dict[str, Union[int, str]]]]]]):\n", + " A list of dictionaries, where each dictionary represents an annotated row.\n", + " Each row should have a 'tokens' key (str) and a 'span_label.responses' key\n", + " (List[Dict[str, Union[int, str]]]).\n", + "\n", + " Returns:\n", + " Tuple[List[List[str]], List[List[str]]]: A tuple containing two elements:\n", + " 1. A list of token lists, where each inner list represents tokens for a row.\n", + " 2. A list of NER tag lists, where each inner list represents NER tags for a row.\n", + " \"\"\"\n", + " tokens = []\n", + " ner_tags = []\n", + " for idx,row in enumerate(annotated_rows):\n", + " tags = extract_ner_tags(row[\"tokens\"], row[\"span_label.responses\"][0])\n", + " tks = row[\"tokens\"].split()\n", + " tokens.append(tks)\n", + " ner_tags.append(tags)\n", + " return tokens, ner_tags\n", + "train_tokens, train_ner_tags = get_tokens_ner_tags(submitted[:1])\n", + "validation_tokens, validation_ner_tags = get_tokens_ner_tags(submitted[1:2])\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mrTxFf0514xA" + }, + "source": [ + "##### Vibe Check\n", + "Its always good to check our data after a few operations. This will help us understand and debug if the output of every steps results in desired output." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 409 + }, + "id": "X9l0QnGPHL27", + "outputId": "be94a4c9-7098-4d80-eeb4-dc8da0d42e7b" + }, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "\n" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
Sample Train Tokens:['The', 'FINFET', 'of', 'claim', '11', ',', 'wherein', 'the', 'conformal', 'gate', 'dielectric', 'comprises', 'a', 'high-κ', 'gate', 'dielectric', 'selected', 'from', 'the', 'group', 'consisting', 'of:', 'hafnium', 'oxide', '(HfO', '2', '),', 'lanthanum', 'oxide', '(La', '2', 'O', '3', '),', 'and', 'combinations', 'thereof.']

" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
Sample Valid Tokens:['The', 'method', 'of', 'claim', '2', ',', 'wherein', 'generating', 'the', 'one', 'or', 'more', 'possible', 'design', 'modification', 'solutions', 'based', 'at', 'least', 'in', 'part', 'on', 'the', 'set', 'of', 'attack', 'mitigation', 'rules', 'comprises', 'generating', 'the', 'one', 'or', 'more', 'possible', 'design', 'modification', 'solutions', 'by', 'inputting', 'the', 'set', 'of', 'attack', 'mitigation', 'rules', 'to', 'a', 'model', 'configured', 'to', 'perform', 'structural', 'and', 'functional', 'analysis', 'to', 'interpret', 'the', 'set', 'of', 'attack', 'mitigation', 'rules,', 'wherein', 'the', 'set', 'of', 'attack', 'mitigation', 'rules', 'comprises', 'one', 'or', 'more', 'rules', 'used', 'by', 'the', 'model', 'to', 'identify', 'the', 'key-gate', 'type', 'for', 'each', 'possible', 'design', 'modification', 'solution', 'of', 'the', 'one', 'or', 'more', 'possible', 'design', 'modification', 'solutions', 'and', 'one', 'or', 'more', 'rules', 'used', 'by', 'the', 'model', 'to', 'identify', 'the', 'location', 'where', 'to', 'insert', 'the', 'key-gate', 'type', 'for', 'each', 'possible', 'design', 'modification', 'solution', 'of', 'the', 'one', 'or', 'more', 'possible', 'design', 'modification', 'solutions.']

" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
Sample Train tags:['O', 'B-Electrical Circuit', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-Chemical Compound', 'I-Chemical Compound', 'O', 'O', 'O', 'B-Chemical Compound', 'I-Chemical Compound', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']

" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
Sample Valid tags:['O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-Process', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-Process', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-Algorithm', 'I-Algorithm', 'I-Algorithm', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-Biotechnology', 'I-Biotechnology', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-Process', 'I-Process', 'O', 'O']
" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "display(HTML('''\n", + "\n", + "'''))\n", + "display(HTML(\"
Sample Train Tokens:\" +\n",
+    "             f\"{train_tokens[0]}

\"))\n", + "display(HTML(\"
Sample Valid Tokens:\" +\n",
+    "             f\"{validation_tokens[0]}

\"))\n", + "display(HTML(\"
Sample Train tags:\" +\n",
+    "             f\"{train_ner_tags[0]}

\"))\n", + "display(HTML(\"
Sample Valid tags:\" +\n",
+    "             f\"{validation_ner_tags[0]}
\"))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OfiwXotz2Dxh" + }, + "source": [ + "As we are trying to have our data creation and model training pipeline working, for simplicity , I have dealing with one record each for training and validation." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xvdsvQgh2Fv4" + }, + "source": [ + "#### Step 4: Map labels (tags) to integers" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": { + "id": "wPqptNmRHOcB" + }, + "outputs": [], + "source": [ + "def mapped_ner_tags(ner_tags: List[List[str]]) -> List[List[int]]:\n", + " \"\"\"\n", + " Convert a list of NER tags to their corresponding integer IDs.\n", + " This function takes a list of lists containing string NER tags, creates a unique mapping\n", + " of these tags to integer IDs, and then converts all tags to their respective IDs.\n", + " Args:\n", + " ner_tags (List[List[str]]): A list of lists, where each inner list contains string NER tags.\n", + " Returns:\n", + " List[List[int]]: A list of lists, where each inner list contains integer IDs\n", + " corresponding to the input NER tags.\n", + " Example:\n", + " >>> ner_tags = [['O', 'B-PER', 'I-PER'], ['O', 'B-ORG']]\n", + " >>> mapped_ner_tags(ner_tags)\n", + " [[0, 1, 2], [0, 3]]\n", + " Note:\n", + " The mapping of tags to IDs is created based on the unique tags present in the input.\n", + " The order of ID assignment may vary between function calls if the input changes.\n", + " \"\"\"\n", + " labels = list(set([item for sublist in ner_tags for item in sublist]))\n", + " id2label = {i: label for i, label in enumerate(labels)}\n", + " label2id = {label: id_ for id_, label in id2label.items()}\n", + " mapped_ner_tags = [[label2id[label] for label in ner_tag] for ner_tag in ner_tags]\n", + " return mapped_ner_tags" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": { + "id": "Ge9nS_PpAvPC" + }, + "outputs": [], + "source": [ + "def get_labels(ner_tags: List[List[str]]) -> List[str]:\n", + " \"\"\"\n", + " Extract unique labels from a list of NER tag sequences.\n", + " This function takes a list of lists containing NER tags and returns a list of unique labels\n", + " found across all sequences.\n", + "\n", + " Args:\n", + " ner_tags (List[List[str]]): A list of lists, where each inner list contains string NER tags.\n", + " Returns:\n", + " List[str]: A list of unique NER labels found in the input sequences.\n", + " Example:\n", + " >>> ner_tags = [['O', 'B-PER', 'I-PER'], ['O', 'B-ORG', 'I-ORG'], ['O', 'B-PER']]\n", + " >>> get_labels(ner_tags)\n", + " ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG']\n", + " Note:\n", + " The order of labels in the output list is not guaranteed to be consistent\n", + " between function calls, as it depends on the order of iteration over the set.\n", + " \"\"\"\n", + " return list(set([item for sublist in ner_tags for item in sublist]))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KEx_Nb8v2MeY" + }, + "source": [ + "#### Step 5: Argilla Dataset to HuggingFace Dataset\n", + "We now have our data in a structure as required for token classification dataset. We will just have to create a Hugging Face Dataset." + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": { + "id": "hH4c2lZ5HWZT" + }, + "outputs": [], + "source": [ + "train_labels = get_labels(train_ner_tags)\n", + "validation_labels = get_labels(validation_ner_tags)\n", + "labels = list(set(train_labels + validation_labels))\n", + "features = Features({\n", + " \"tokens\": Sequence(Value(\"string\")),\n", + " \"ner_tags\": Sequence(ClassLabel(num_classes=len(labels), names=labels))\n", + "})\n", + "train_records = [\n", + " {\n", + " \"tokens\": token,\n", + " \"ner_tags\": ner_tag,\n", + " }\n", + " for token, ner_tag in zip(train_tokens, mapped_ner_tags(train_ner_tags))\n", + "]\n", + "validation_records = [\n", + " {\n", + " \"tokens\": token,\n", + " \"ner_tags\": ner_tag,\n", + " }\n", + " for token, ner_tag in zip(validation_tokens, mapped_ner_tags(validation_ner_tags))\n", + "]\n", + "span_dataset = DatasetDict(\n", + " {\n", + " \"train\": Dataset.from_list(train_records,features=features),\n", + " \"validation\": Dataset.from_list(validation_records,features=features),\n", + " }\n", + ")\n" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": { + "id": "yHfgcVTyHZh1" + }, + "outputs": [], + "source": [ + "# assertion to verify if train split conforms the dataset structure required for fine-tuning.\n", + "assert span_dataset['train'].features['ner_tags'].feature.names is not None" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2PKgAnGY2Wxo" + }, + "source": [ + "#### Step 6: Push dataset to Hugginface Hub" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "emltkVVMHdAm", + "outputId": "b98ab8d6-5516-45b9-8cfa-c86f45e630c5" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + " _| _| _| _| _|_|_| _|_|_| _|_|_| _| _| _|_|_| _|_|_|_| _|_| _|_|_| _|_|_|_|\n", + " _| _| _| _| _| _| _| _|_| _| _| _| _| _| _| _|\n", + " _|_|_|_| _| _| _| _|_| _| _|_| _| _| _| _| _| _|_| _|_|_| _|_|_|_| _| _|_|_|\n", + " _| _| _| _| _| _| _| _| _| _| _|_| _| _| _| _| _| _| _|\n", + " _| _| _|_| _|_|_| _|_|_| _|_|_| _| _| _|_|_| _| _| _| _|_|_| _|_|_|_|\n", + "\n", + " To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .\n", + "Enter your token (input will not be visible): \n", + "Add token as git credential? (Y/n) n\n", + "Token is valid (permission: write).\n", + "Your token has been saved to /root/.cache/huggingface/token\n", + "Login successful\n" + ] + } + ], + "source": [ + "!huggingface-cli login" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 202, + "referenced_widgets": [ + "e4f9cff7519b401a9db04f247b7e473c", + "9b4e205349554a6fa38b50f19645181a", + "9d85ae1441804dfba44528e1bc543a7d", + "17c50ff23aa54b7ebb142c5df5615f78", + "f34a9eb89d804712bf6ee63dbc685017", + "764dfd183d054eaa9ce458041be856ef", + "9e7ada0da57c4f7a9994ae22e05407d5", + "d7aa2c0be22b41c2841047336bef2480", + "ebfe6ea4ec834638a0246cd40cda9ea4", + "d03837e77b97463687a54e7b81feb680", + "70e15d43fccf47399525cbc6578618a2", + "13636a22e24f4b1287bb4234f15f6a32", + "3d6486dae69f4d50b471617044353d20", + "2fb53aa82ac34a1fadac2a0a97840a15", + "748b3a0616b544969b6ddaf8cd70092d", + "945498057b214ee1a16de0391ebc87ab", + "16343bcab2124dd28cfc5fccfb89f5a6", + "2fdc2f2670824547a27ad9e298e10534", + "c47bc520c9684382a2c0e484a0f887ff", + "eafea0fa0b864434a723ec6391ef2876", + "13e35558d8f0416fb99629ef3484bbab", + "899f920fd1ea4ee2b230f89a337304ec", + "eec3edc793a84abeb1af22221b24b3c8", + "a7dfe44a19de4ffca626c2d610759297", + "c96f3210638442c195f0511b140793ba", + "c09bd90e783f4f5c88c2d57191fbdc19", + "531daeb371ad4b51b1d493053d8b1e52", + "a1951348ae654b4583611c5e0425d81b", + "34a5f08bf5f849c4829f3b83fd27bbec", + "4a7301446bba44b798a70a27260e0d56", + "ccd15d0d08c743b08ac1419d25199a42", + "11b6ad94ce8d419fb55a3817997037af", + "4617899577864785a7a85be21c36fda4", + "7ec8f069212f455589812568d35f7cde", + "b1266aa9efbd400b877df36a942f4337", + "b4daac723071481da2f3d7a5e1517b5f", + "c641704c716a40fcb3d54720cb4ff998", + "4a39f8caa9214894a2a9e32cf0087105", + "5e6fea490d71483b9fabf0d5f84f1934", + "e02a23c6f06f42639b82c65ce72246c2", + "71449aa1769643c8b57094d35bc8b35f", + "1303f70db20f45a39195b9872c8482ab", + "b01ba70e26da4b83b096c1b23da688d4", + "35b2a6e0f8c543c4995e72d1543c39aa" + ] + }, + "id": "mKVJEUhmHjA2", + "outputId": "1ebb4f40-25f5-4168-87f9-f9381cd92fb6" + }, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "e4f9cff7519b401a9db04f247b7e473c", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Uploading the dataset shards: 0%| | 0/1 [00:00" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "display_image(\"/content/images/autotrain_screen1.png\")\n", + "display_image(\"/content/images/autotrain_screen2.png\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "N3OZQbGz3Cq3" + }, + "source": [ + "### 5.1 Using AutoTrain UI" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "s9j5hzLdgH4K" + }, + "source": [ + "After space creation, AutoTrain UI will allow us to select from range of tasks. We will have to configure our trainer on the AutoTrain UI.\n", + "1. We will select Token classification as our task.\n", + "2. For our tutorial we will fine-tune `google-bert/bert-base-uncased`. We can choose any model from the list.\n", + "3. For DataSource select `Hugging Face Hub` which will give us a text box to fill in the dataset which we want to use for fine-tuning. We will use the dataset we pushed to Huggingface hub. I will be using the dataset that I pushed to huggingface hub `bikashpatra/claims_annotated_hf`\n", + "4. Enter the keys for `train` and `validation` split.\n", + "5. Under Column Mapping , enter the columns which store the tokens and tags. In my dataset , tokens are stored in `tokens` column and labels are stored in `ner_tags` column.\n", + "With the above 5 inputs, we can trigger `Start Training` and AutoTrain will take care of fine-tuning the base model on our dataset." + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 625 + }, + "id": "-d22KO0Wtb7C", + "outputId": "c91c5fd2-eb3a-4ce7-b5d4-842944e6a919" + }, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "display_image(\"/content/images/autotrain_ui.png\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wi2EnUGcZ_4x" + }, + "source": [ + "### 5.2 Using AutoTrain CLI" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": { + "id": "ZpfIZ7GrMskA" + }, + "outputs": [], + "source": [ + "# for this cell to work, you will have to store HF_TOKEN as secret in colab notebook.\n", + "os.environ['TOKEN'] = userdata.get('HF_TOKEN')" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "knmZ3N4bde7Q", + "outputId": "0562a30b-821c-4a5b-fc0c-bfa05d5c5ded" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\u001b[1mINFO \u001b[0m | \u001b[32m2024-08-20 06:44:18\u001b[0m | \u001b[36mautotrain.cli.run_token_classification\u001b[0m:\u001b[36mrun\u001b[0m:\u001b[36m179\u001b[0m - \u001b[1mRunning Token Classification\u001b[0m\n", + "\u001b[33m\u001b[1mWARNING \u001b[0m | \u001b[32m2024-08-20 06:44:18\u001b[0m | \u001b[36mautotrain.trainers.common\u001b[0m:\u001b[36m__init__\u001b[0m:\u001b[36m180\u001b[0m - \u001b[33m\u001b[1mParameters supplied but not used: version, inference, config, func, train, deploy, backend\u001b[0m\n", + "\u001b[1mINFO \u001b[0m | \u001b[32m2024-08-20 06:44:22\u001b[0m | \u001b[36mautotrain.cli.run_token_classification\u001b[0m:\u001b[36mrun\u001b[0m:\u001b[36m185\u001b[0m - \u001b[1mJob ID: bikashpatra/autotrain-claims-token-classification\u001b[0m\n" + ] + } + ], + "source": [ + "!autotrain token-classification --train \\\n", + " --username \"bikashpatra\" \\\n", + " --token $TOKEN \\\n", + " --backend \"spaces-a10g-small\" \\\n", + " --project-name \"claims-token-classification\" \\\n", + " --data-path \"bikashpatra/sample_claims_annotated_hf\" \\\n", + " --train-split \"train\" \\\n", + " --valid-split \"validation\" \\\n", + " --tokens-column \"tokens\" \\\n", + " --tags-column \"ner_tags\" \\\n", + " --model \"distilbert-base-uncased\" \\\n", + " --lr \"2e-5\" \\\n", + " --log \"tensorboard\" \\\n", + " --epochs \"10\" \\\n", + " --weight-decay \"0.01\" \\\n", + " --warmup-ratio \"0.1\" \\\n", + " --max-seq-length \"256\" \\\n", + " --mixed-precision \"fp16\" \\\n", + " --push-to-hub" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_d5psaOcd9RQ" + }, + "source": [ + "AutoTrain automatically creates huggingface space for us and triggers the training job. Link to the space created is `https://huggingface.co/spaces/$JOBID where JOBID is the value that we get from the logs of autotrain cli command.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "K33a7lmFfFcp" + }, + "source": [ + "If the model training executes without any errors, our model is available with the value we provided to `--project-name`. In the above example it was `claims-token-classification`" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "E0z1Z237gUgb" + }, + "source": [ + "## 6. Inference\n", + "With all the hardwork done, we have our model trained our custom dataset.We can use our trained model to predict labels for un-annotated rows.\n", + "We will use [HF Pipelines](https://huggingface.co/docs/transformers/main_classes/pipelines) api. Pipelines are easy to use abstraction to load model and execute inference on un-seen data.In context of this tutorial _inference on un-seen text_ means predicting labels for tokens in un-annotated text." + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": { + "id": "GcAgUShwgBQG" + }, + "outputs": [], + "source": [ + "# Classify a sample text\n", + "claims_text = \"\"\"\n", + "The FINFET of claim 11 , wherein the conformal gate dielectric comprises a high-κ gate dielectric selected from\n", + "the group consisting of: hafnium oxide (HfO 2 ), lanthanum oxide (La 2 O 3 ), and combinations thereof.\n", + "\"\"\"\n", + "classifier = pipeline(\"token-classification\", model=\"bikashpatra/claims-token-classification\",device=\"cpu\")\n", + "preds = classifier(claims_text)" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "uhKfsHJIt4GW", + "outputId": "9515118f-5bbd-4dac-cad4-2116a553a653" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "{0: 'B-Chemical Compound',\n", + " 1: 'I-Biotechnology',\n", + " 2: 'B-Electrical Circuit',\n", + " 3: 'B-Process',\n", + " 4: 'B-Biotechnology',\n", + " 5: 'O',\n", + " 6: 'I-Chemical Compound',\n", + " 7: 'I-Process',\n", + " 8: 'B-Algorithm',\n", + " 9: 'I-Algorithm'}" + ] + }, + "execution_count": 29, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# The labels used for fine-tuning the model.\n", + "classifier.model.config.id2label" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QeuBwT8BobXm" + }, + "source": [ + "## 7. Push predictions to Argilla Dataset\n", + "Using [`rg.Query`](https://docs.argilla.io/latest/how_to_guides/query/) api we filter un-annotated data and predict tokens.\n", + "\n", + "The filter `rg.Filter((\"response.status\",\"==\",\"pending\"))` allows us to create a Argilla filter which we pass to [`rg.Query`](https://docs.argilla.io/latest/how_to_guides/query/) to get us all the records in Argilla dataset which has not been annotated." + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": { + "id": "_6DFhfFJQx5X" + }, + "outputs": [], + "source": [ + "# Create a filter query to get only `pending` records in argilla dataset.\n", + "status_filter = rg.Query(filter=rg.Filter((\"response.status\", \"==\", \"pending\")))\n", + "\n", + "submitted = rg_dataset.records(status_filter).to_list(flatten=True)\n", + "claims = random.sample(submitted,k=10) # pick 10 random samples.\n", + "\n", + "spans = classifier(claims[0]['tokens'])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zvTdHkk_sr3r" + }, + "source": [ + "### 7.1 Helper function to predict the spans" + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "metadata": { + "id": "QFbQcRmEe7r9" + }, + "outputs": [], + "source": [ + "def predict_spanmarker(pipe:TokenClassificationPipeline,text: str):\n", + " \"\"\"\n", + " Predict span markers for the given text using the provided pipeline.\n", + " Args:\n", + " pipe (TokenClassificationPipeline): A pipeline object for token classification.\n", + " text (str): The input text for which span markers are to be predicted.\n", + " Returns:\n", + " List[Dict[str, Union[int, str]]]: A list of dictionaries containing the predicted span markers.\n", + " Each dictionary should have 'start', 'end', and 'label' keys.\n", + " \"\"\"\n", + " markers = pipe(text)\n", + " spans = [\n", + " {\"label\": marker[\"entity\"][2:], \"start\": marker[\"start\"], \"end\": marker[\"end\"]}\n", + " for marker in markers if marker[\"entity\"] != \"O\"\n", + " ]\n", + " return spans" + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "metadata": { + "id": "BTn-_U3GgOxg" + }, + "outputs": [], + "source": [ + "updated_data=[\n", + " {\n", + " \"span_label\": predict_spanmarker(pipe=classifier, text=sample['tokens']),\n", + " \"id\": sample[\"id\"],\n", + " }\n", + " for sample in claims\n", + "]" + ] + }, + { + "cell_type": "code", + "execution_count": 48, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "7OJI3t-WlWgX", + "outputId": "5c90b86c-3bf6-4574-f577-d53f9659c870" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "[{'label': 'Chemical Compound', 'start': 0, 'end': 3},\n", + " {'label': 'Process', 'start': 4, 'end': 10}]" + ] + }, + "execution_count": 48, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# print a few predictions\n", + "updated_data[0]['span_label'][:2]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Dt9sZUMAszvq" + }, + "source": [ + "### 7.2 Insert records to Argilla Dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 137 + }, + "id": "G3qkBROggKQ6", + "outputId": "160fefe6-acde-498d-9b4e-edea293b5942" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
DatasetRecords: The provided batch size 256 was normalized. Using value 10.\n",
+       "
\n" + ], + "text/plain": [ + "DatasetRecords: The provided batch size \u001b[1;36m256\u001b[0m was normalized. Using value \u001b[1;36m10\u001b[0m.\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Sending records...: 100%|██████████| 1/1 [00:00<00:00, 1.15batch/s]\n" + ] + }, + { + "data": { + "text/plain": [ + "DatasetRecords(Dataset(id=UUID('a187cdad-175e-4d87-989f-a529b9999bde') inserted_at=datetime.datetime(2024, 7, 28, 7, 23, 59, 902685) updated_at=datetime.datetime(2024, 7, 28, 7, 35, 55, 80617) name='claim_tokens' status='ready' guidelines='Classify individual tokens according to the specified categories, ensuring that any overlapping or nested entities are accurately captured.' allow_extra_metadata=False distribution=None workspace_id=UUID('fd4fc24c-fc1f-4ffe-af41-d569432d6b50') last_activity_at=datetime.datetime(2024, 7, 28, 7, 35, 55, 80181)))" + ] + }, + "execution_count": 35, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "rg_dataset.records.log(records=updated_data)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3fuExqyYK2yx" + }, + "source": [ + "The records we update here are stored as [`suggestions`](https://docs.argilla.io/latest/reference/argilla/records/suggestions/) and not [`responses`](https://docs.argilla.io/latest/reference/argilla/records/responses/). Responses in the context of this tutorial are created when annotator saves a annotation.Suggestions are labels predicted by model.Therefore, the records we updated here will have `response.status` as `pending` and not `submitted`. This will allow us/annotators to check the predicted labels and accept or reject model predictions.\n", + "\n", + "If we want to accept model predicted annotations for tokens in a text, we may save the [`suggestions`] as [`responses`], else we will have to add / remove / edit labels applied to tokens." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "r96xrhjeASAG" + }, + "source": [ + "## 8. Conclusion\n", + "\n", + "In this comprehensive tutorial, we've explored a complete workflow for data annotation and model fine-tuning. We began by setting up an [Argilla](https://argilla.io/) instance on [Hugging Face Spaces](https://huggingface.co/spaces), providing a robust platform for data management. We then configured and created a dataset within our Argilla instance, leveraging its user-friendly interface to manually annotate a subset of records.\n", + "\n", + "We continued as we exported the high-quality annotated data to a Hugging Face [dataset](https://huggingface.co/datasets), bridging the gap between annotation and model training. We then demonstrated the power of transfer learning by fine-tuning a `distilbert-base-uncased` model on this curated dataset using Hugging Face's [AutoTrain](https://huggingface.co/autotrain), a tool that simplifies the complexities of model training.\n", + "\n", + "The workflow came full circle as we applied our fine-tuned model to annotate the remaining unlabeled records in the Argilla dataset, showcasing how machine learning can accelerate the annotation process. This tutorial should provide a solid foundation for implementing an iterative annotation and fine-tuning pipeline while illustrating the synergy between human expertise and machine learning capabilities.\n", + "\n", + "This iterative approach allows for continuous improvement, making it an invaluable tool for tackling a wide range of natural language processing tasks efficiently and effectively." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "99YxP5sKGjxN" + }, + "source": [ + "## 9. Acknowledgements\n", + "\n", + "I would like to express my sincere gratitude to the following individuals who have contributed to this notebook:\n", + "\n", + "- **[David Berenstein](https://x.com/davidberenstei)** for his invaluable insights and guidance.\n", + "- **[Sara Han](https://x.com/sdiazlor)** for answering my frequent queries on discord.\n", + "\n", + "This work would not have been possible without their support and expertise.\n", + "\n", + "Additionally, a nicer version of this notebook can be seen by replacing **github** in `https://github.com/bikash119/argilla/blob/argilla_with_autotrain/argilla/docs/community/token_classification_tutorial.ipynb` with **nbsanity**. Thanks to **[Hamel Hussain](https://x.com/HamelHusain)** for creating the notebook rendering utility." + ] + } + ], + "metadata": { + "accelerator": "GPU", + "colab": { + "collapsed_sections": [ + "VLwsHGv1H7qn", + "Gv3ctU0RcgcZ", + "txIoyMIBcgWq", + "aWU2pWV1cgT5", + "ZlM-G2oWcgOQ", + "gNCS-BwJx_XM", + "vjLZrBL62myl", + "E0z1Z237gUgb", + "QeuBwT8BobXm", + "r96xrhjeASAG" + ], + "gpuType": "T4", + "provenance": [], + "toc_visible": true + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.4" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/argilla/docs/community/token_classification_tutorial.md b/argilla/docs/community/token_classification_tutorial.md new file mode 100644 index 0000000000..dc8a9c6b1c --- /dev/null +++ b/argilla/docs/community/token_classification_tutorial.md @@ -0,0 +1,914 @@ +Open In Colab + +# Fine-tuning a token classification model using custom Argilla Dataset and HuggingFace AutoTrain + +We all would want to try out to solve some use case with a neat tool / techs available out there. +In this tutorial , I want to go over my learning journey to fine tune a model on US Patent text. + +## 1. Introduction + + + +### 1.1 Background on Named Entity Recognition (NER) + +Named Entity Recognition (NER) is a fundamental task in Natural Language Processing (NLP) that involves identifying and classifying named entities in text into predefined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, etc. + +### 1.2 Importance of NER in Natural Language Processing + +NER plays a crucial role in various NLP applications, including: +- Information Retrieval +- Question Answering Systems +- Machine Translation +- Text Summarization +- Sentiment Analysis + +### 1.3 Challenges in NER for Specific Domains or Languages + +While general-purpose NER models exist, they often fall short when applied to specialized domains or less-common languages due to: +- Domain-specific terminology +- Unique entity types +- Language-specific nuances + +### 1.4 The Need for Custom, Fine-tuned Models + +To address these challenges, fine-tuning custom NER models becomes essential. This approach allows for: +- Adaptation to specific domains: A fine-tuned model can perform better on specific tasks or domains compared to general-purpose model. +- Efficiency: Fine-tuned models often require less data and computational resources to achieve good performance on specific tasks. +- Faster inference: Smaller, task-specific models run faster than larger, general purpose ones. + +### 1.5 Project Objectives and Overview + +In this project, we aim to fine-tune a custom NER model for USPTO Patents. Our objectives include: +- Use [Hugging Face Spaces](https://huggingface.co/spaces) to setup an instance of [Argilla](https://argilla.io/). +- Use [Argilla](https://argilla.io/) UI to annotate our dataset with custom labels. +- Use Hugging Face [AutoTrain](https://huggingface.co/autotrain) to create a more efficient model in terms of size and inference speed. +- Demonstrating the effectiveness of transfer learning in NER tasks. + + +## 2. Data Background + + + + + +US Patent texts are typically long, descriptive documents about inventions. The data used in this tutorial can be accessed through the [Kaggle USPTO Competition](https://www.kaggle.com/competitions/uspto-explainable-ai). Each patent contains several fields: +- Title +- Abstract +- Claims +- Description + +For this tutorial, we'll focus on the `claims` field. + +### 2.1 Problem Statement + + + Our goal is to fine-tune a model to classify tokens in the `claims` field of a given patent. + +### 2.2 Breaking Down the Problem + + +To achieve this goal, we need: + +1. High-quality data to fine-tune a pretrained token classification model +2. Infrastructure to execute the training + +## 3. Create High-Quality Data with Argilla + [Argilla](https://github.com/argilla-io/argilla/) is an excellent tool for creating high-quality datasets with a user-friendly interface for labeling. + + + +### 3.1 Setting Up Argilla on Hugging Face Spaces + +#### 1. Visit [Hugging Face Spaces deployment page](https://huggingface.co/new-space?template=argilla/argilla-template-space) + +#### 2. Create a new space: + - Provide a name + - Select `Docker` as Space SDK + - Choose `Argilla` as Docker Template + - Leave other fields empty for simplicity + - Click on `Create Space` + +#### 3. Restart the Space + +Now you have an Argilla instance running on Hugging Face Spaces. Click on the space you created to go to the login screen of Argilla UI. +Access the UI using the credentials: + +- Username: `admin` +- Password: `12345678` [default password] + +For more options and setting up the Argilla instance for production use-cases, please refer to [Configure Argilla on Huggingface](https://docs.argilla.io/dev/getting_started/how-to-configure-argilla-on-huggingface/) + +### 3.2 Create a Dataset with Argilla Python SDK + +#### Step 1: Install & Import packages + + +```python +!pip install -U datasets argilla autotrain-advanced==0.8.8 > install_logs.txt 2>&1 +``` + + +```python +import argilla as rg +import pandas as pd +import re +import os +import random +import torch +from IPython.display import Image, display,HTML +from datasets import load_dataset, Dataset, DatasetDict,ClassLabel,Sequence,Value,Features +from transformers import pipeline,TokenClassificationPipeline +from typing import List, Dict, Union,Tuple +from google.colab import userdata +``` + +#### Step 2: Initialize the Argilla Client +api_url: We can get this URL by using the `https://huggingface.co/spaces//?embed=true` + + + +```python +client = rg.Argilla( + api_url="https://bikashpatra-argilla-uspto-labelling.hf.space", + #api_url="https://-.hf.space # This is url to my public space. + api_key="admin.apikey", # default value. Shouldn't be used for production. + # headers={"Authorization": f"Bearer {HF_TOKEN}"} +) +#Replace `` and `` with your actual Hugging Face username and space name. +``` + +#### Step 3: Configure the Dataset +To configure an Argilla dataset for token classification task, we will have to + +1. Come up with labels specific to our problem domain: I came up with some labels by using the following prompt + >suggest me some labels like "Process", "Product", "Composition of Matter" which can be used to annotate tokens in the claims or description section of patents filed in US + +2. We need to configure fields/columns of our dataset and [`questions`](https://docs.argilla.io/latest/how_to_guides/dataset/#questions). The `questions` parameter allows you to instruct /guide the annotator on the task.In our usecase, we shall use `labels` we created for the annotators to select when annotating pieces (tokens) of text. + + + +```python +# Labels for token classification +labels = [ + "Process", "Product", "Composition of Matter", "Method of Use", + "Software", "Hardware", "Algorithm", "System", "Device", + "Apparatus", "Method", "Machine", "Manufacture", "Design", + "Pharmaceutical Formulation", "Biotechnology", "Chemical Compound", + "Electrical Circuit" +] + +# Dataset settings +settings = rg.Settings( + guidelines="Classify individual tokens according to the specified categories, ensuring that any overlapping or nested entities are accurately captured.", + fields=[ + rg.TextField(name="tokens", title="Text", use_markdown=True), + rg.TextField(name="document_id", title="publication_number", use_markdown=True), + rg.TextField(name="sentence_id", title="sentence_id", use_markdown=False) + ], + questions=[ + rg.SpanQuestion( + name="span_label", + field="tokens", + labels=labels, + title="Classify the tokens according to the specified categories.", + allow_overlapping=True + ) + ] +) +``` + +#### Step 4: Create dataset on Argilla instance +With the settings in places, we are ready to create our dataset using [`rg.Dataset`](https://docs.argilla.io/latest/how_to_guides/dataset/#create-a-dataset) api to create our dataset. + + +```python +# We name the dataset as claim_tokens +rg_dataset = rg.Dataset( + name="claim_tokens", + settings=settings, +) +rg_dataset.create() +``` + + /usr/local/lib/python3.10/dist-packages/argilla/datasets/_resource.py:202: UserWarning: Workspace not provided. Using default workspace: admin id: fd4fc24c-fc1f-4ffe-af41-d569432d6b50 + warnings.warn(f"Workspace not provided. Using default workspace: {workspace.name} id: {workspace.id}") + + + + + + Dataset(id=UUID('a187cdad-175e-4d87-989f-a529b9999bde') inserted_at=datetime.datetime(2024, 7, 28, 7, 23, 59, 902685) updated_at=datetime.datetime(2024, 7, 28, 7, 24, 1, 901701) name='claim_tokens' status='ready' guidelines='Classify individual tokens according to the specified categories, ensuring that any overlapping or nested entities are accurately captured.' allow_extra_metadata=False workspace_id=UUID('fd4fc24c-fc1f-4ffe-af41-d569432d6b50') last_activity_at=datetime.datetime(2024, 7, 28, 7, 24, 1, 901701) url=None) + + + + After step 4 we should see the dataset created in Argilla UI. We can verify this by logging in to the Argilla UI `url https://huggingface.co/spaces/-.hf.space)` with the default credentials. + + +We can look into the settings of the dataset by clicking on the settings icon next to the dataset name. + + + +```python +def display_image(filename): display(Image(filename=filename)) + +display_image('/content/images/argilla_ds_list_settings.png') +``` + + + +![png](token_classification_tutorial_files/token_classification_tutorial_25_0.png) + + + + The Fields tab of settings screen lists down fields we configured while creating the dataset using Python SDK. + + +```python +display_image('/content/images/argilla_ds_settings.png') +``` + + + +![png](token_classification_tutorial_files/token_classification_tutorial_27_0.png) + + + +#### Step 5: Insert records to the Argilla datasets + +Data preparation notebook can be found [here](https://www.kaggle.com/code/boredmgr/claim-sampling) + + +```python +claims = pd.read_csv("/content/sample_publications.csv") +claims.head(2) +``` + + + + + + + + + + + + + + + + + + + + + + + + + +
publication_numbersequence_idtokens
0US-4444749-A0A shampoo comprising an aqueous solution of an...
1US-4444749-A1A shampoo comprising an aqueous solution of an...
+ + +Here we are reading rows of the csv and mapping them to the fields we created during Argilla dataset configuration step. + + +```python +## We upload a csv with three columns : tokens, publication_number, sequence_id + +publication_df = pd.read_csv("/content/sample_publications.csv") +## Convert dataframe rows to Argilla Records +records = [ + rg.Record( + fields= + {"tokens": "".join(row["tokens"]) + ,'document_id':str(row['publication_number']) + ,'sentence_id':str(row['sequence_id']) + }) + for _,row in publication_df.iterrows() + ] + ## Store Argilla records to Argilla Dataset +rg_dataset.records.log(records) +``` + + +
DatasetRecords: The provided batch size 256 was normalized. Using value 149.
+
+ + + + Sending records...: 100%|██████████| 1/1 [00:00<00:00, 1.71batch/s] + + + + + + DatasetRecords(Dataset(id=UUID('a187cdad-175e-4d87-989f-a529b9999bde') inserted_at=datetime.datetime(2024, 7, 28, 7, 23, 59, 902685) updated_at=datetime.datetime(2024, 7, 28, 7, 24, 1, 901701) name='claim_tokens' status='ready' guidelines='Classify individual tokens according to the specified categories, ensuring that any overlapping or nested entities are accurately captured.' allow_extra_metadata=False workspace_id=UUID('fd4fc24c-fc1f-4ffe-af41-d569432d6b50') last_activity_at=datetime.datetime(2024, 7, 28, 7, 24, 1, 901701) url=None)) + + + +Once, we have records pushed to Argilla Dataset, the UI will render the records and the labels for the annotator to annotate the text. + +Check the screeshots below. + + +```python +display_image("/content/images/annotation_screen.png") +``` + + + +![png](token_classification_tutorial_files/token_classification_tutorial_34_0.png) + + + +#### Step 6 : Annotate tokens in every records with appropriate labels. +Login to the Argilla UI and start annotating. + +Argilla UI : `https://huggingface.co/spaces//` + +username : `admin` + +password : `12345678` + + + +> After annotating the data , we will have to convert Argilla Dataset to HuggingFace dataset in order to use HuggingFace AutoTrain for fine-tuning the model. HF AutoTrain allows training on CSV data too which can be uploaded from AutoTrain UI. But for this tutorial we will use Huggingface dataset. + + + +## 4. Argilla Dataset to HuggingFace Dataset + +#### Step 1: Load our annotated dataset + + +```python +rg_dataset = client.datasets("claim_tokens") +``` + +#### Step 2 : Filter the rows / records which are annotated. +For us to have quick iterations on annotation and training, we should be able to annotate a few records and train our model.We can achieve it by using the [query/filter](https://docs.argilla.io/latest/how_to_guides/query/) functionality of Argilla Dataset. + +Using [`rg.Query()`](https://docs.argilla.io/latest/how_to_guides/query/) api we can filter the records which are annotated for preparing our training dataset. + + +```python +status_filter = rg.Query(filter=rg.Filter(("response.status", "==", "submitted"))) + +submitted = rg_dataset.records(status_filter).to_list(flatten=True) +submitted[0] +``` + + + + + {'id': '01e9b4bb-9c98-4cec-acea-dd686cddf5f0', + 'status': 'pending', + '_server_id': '0b6f16f3-c3dc-4947-ac77-8b65002bf350', + 'tokens': 'The FINFET of claim 11 , wherein the conformal gate dielectric comprises a high-κ gate dielectric selected from the group consisting of: hafnium oxide (HfO 2 ), lanthanum oxide (La 2 O 3 ), and combinations thereof.', + 'document_id': 'US-11631617-B2', + 'sentence_id': '14', + 'span_label.responses': [[{'label': 'Electrical Circuit', + 'start': 4, + 'end': 10}, + {'label': 'Chemical Compound', 'start': 138, 'end': 151}, + {'label': 'Chemical Compound', 'start': 162, 'end': 177}]], + 'span_label.responses.users': ['4e9588d6-e2d6-450d-82c6-b33324d94708'], + 'span_label.responses.status': ['submitted']} + + + +The annotated dataset cannot be fed as is to the model for fine-tuning. For token-classification task, we will have to make our data that adheres to the structure as described below. +- Dataset Structure: The dataset should typically have two main columns: + - `tokens`: A list of words/tokens for each example. + - `ner_tags`: A list of corresponding labels for each token. The labels must follow the [IOB labelling scheme](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)). +- Label Encoding: The labels should be integers, with each integer corresponding to a specific named entity tag. +Below functions will allow us to convert our Argilla dataset to the required dataset structure. + + + +```python +def get_iob_tag_for_token(token_start:int, token_end:int, ner_spans:List[Dict[str, Union[int, str]]]) -> str: + """ + Determine the IOB tag for a given token based on its position within NER spans. + + Args: + token_start (int): The start index of the token in the text. + token_end (int): The end index of the token in the text. + ner_spans (List[Dict[str, Union[int, str]]]): A list of dictionaries containing NER span information. + Each dictionary should have 'start', 'end', and 'label' keys. + + Returns: + str: The IOB tag for the token. 'B-' prefix for the beginning of an entity, + 'I-' for inside an entity, or 'O' for outside any entity. + """ + for span in ner_spans: + if token_start >= span["start"] and token_end <= span["end"]: + if token_start == span["start"]: + return f"B-{span['label']}" + else: + return f"I-{span['label']}" + return "O" + + +def extract_ner_tags(text:str, responses:List[Dict[str, Union[int, str]]]): + """ + Extract NER tags for tokens in the given text based on the provided NER responses. + + Args: + text (str): The input text to be tokenized and tagged. + responses (List[Dict[str, Union[int, str]]]): A list of dictionaries containing NER span information. + Each dictionary should have 'start', 'end', and 'label' keys. + + Returns: + List[str]: A list of IOB tags corresponding to each non-whitespace token in the text. + """ + tokens = re.split(r"(\s+)", text) + ner_tags = [] + current_position = 0 + for token in tokens: + if token.strip(): + token_start = current_position + token_end = current_position + len(token) + tag = get_iob_tag_for_token(token_start, token_end, responses) + ner_tags.append(tag) + current_position += len(token) + return ner_tags +``` + +#### Step 3: Get tokens and theirs respective annotations + + +```python +def get_tokens_ner_tags(annotated_rows) -> Tuple[List[List[str]], List[List[str]]]: + """ + Extract tokens and their corresponding NER tags from annotated rows. + + This function processes a list of annotated rows, where each row contains + tokens and span labels. It splits the tokens and extracts NER tags for each token. + + Args: + annotated_rows (List[Dict[str, Union[str, List[Dict[str, Union[int, str]]]]]]): + A list of dictionaries, where each dictionary represents an annotated row. + Each row should have a 'tokens' key (str) and a 'span_label.responses' key + (List[Dict[str, Union[int, str]]]). + + Returns: + Tuple[List[List[str]], List[List[str]]]: A tuple containing two elements: + 1. A list of token lists, where each inner list represents tokens for a row. + 2. A list of NER tag lists, where each inner list represents NER tags for a row. + """ + tokens = [] + ner_tags = [] + for idx,row in enumerate(annotated_rows): + tags = extract_ner_tags(row["tokens"], row["span_label.responses"][0]) + tks = row["tokens"].split() + tokens.append(tks) + ner_tags.append(tags) + return tokens, ner_tags +train_tokens, train_ner_tags = get_tokens_ner_tags(submitted[:1]) +validation_tokens, validation_ner_tags = get_tokens_ner_tags(submitted[1:2]) + +``` + +##### Vibe Check +Its always good to check our data after a few operations. This will help us understand and debug if the output of every steps results in desired output. + + +```python +display(HTML(''' + +''')) +display(HTML("
Sample Train Tokens:" +
+             f"{train_tokens[0]}

")) +display(HTML("
Sample Valid Tokens:" +
+             f"{validation_tokens[0]}

")) +display(HTML("
Sample Train tags:" +
+             f"{train_ner_tags[0]}

")) +display(HTML("
Sample Valid tags:" +
+             f"{validation_ner_tags[0]}
")) +``` + + + + + + + + +
Sample Train Tokens:['The', 'FINFET', 'of', 'claim', '11', ',', 'wherein', 'the', 'conformal', 'gate', 'dielectric', 'comprises', 'a', 'high-κ', 'gate', 'dielectric', 'selected', 'from', 'the', 'group', 'consisting', 'of:', 'hafnium', 'oxide', '(HfO', '2', '),', 'lanthanum', 'oxide', '(La', '2', 'O', '3', '),', 'and', 'combinations', 'thereof.']

+ + + +
Sample Valid Tokens:['The', 'method', 'of', 'claim', '2', ',', 'wherein', 'generating', 'the', 'one', 'or', 'more', 'possible', 'design', 'modification', 'solutions', 'based', 'at', 'least', 'in', 'part', 'on', 'the', 'set', 'of', 'attack', 'mitigation', 'rules', 'comprises', 'generating', 'the', 'one', 'or', 'more', 'possible', 'design', 'modification', 'solutions', 'by', 'inputting', 'the', 'set', 'of', 'attack', 'mitigation', 'rules', 'to', 'a', 'model', 'configured', 'to', 'perform', 'structural', 'and', 'functional', 'analysis', 'to', 'interpret', 'the', 'set', 'of', 'attack', 'mitigation', 'rules,', 'wherein', 'the', 'set', 'of', 'attack', 'mitigation', 'rules', 'comprises', 'one', 'or', 'more', 'rules', 'used', 'by', 'the', 'model', 'to', 'identify', 'the', 'key-gate', 'type', 'for', 'each', 'possible', 'design', 'modification', 'solution', 'of', 'the', 'one', 'or', 'more', 'possible', 'design', 'modification', 'solutions', 'and', 'one', 'or', 'more', 'rules', 'used', 'by', 'the', 'model', 'to', 'identify', 'the', 'location', 'where', 'to', 'insert', 'the', 'key-gate', 'type', 'for', 'each', 'possible', 'design', 'modification', 'solution', 'of', 'the', 'one', 'or', 'more', 'possible', 'design', 'modification', 'solutions.']

+ + + +
Sample Train tags:['O', 'B-Electrical Circuit', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-Chemical Compound', 'I-Chemical Compound', 'O', 'O', 'O', 'B-Chemical Compound', 'I-Chemical Compound', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']

+ + + +
Sample Valid tags:['O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-Process', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-Process', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-Algorithm', 'I-Algorithm', 'I-Algorithm', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-Biotechnology', 'I-Biotechnology', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-Process', 'I-Process', 'O', 'O']
+ + +As we are trying to have our data creation and model training pipeline working, for simplicity , I have dealing with one record each for training and validation. + +#### Step 4: Map labels (tags) to integers + + +```python +def mapped_ner_tags(ner_tags: List[List[str]]) -> List[List[int]]: + """ + Convert a list of NER tags to their corresponding integer IDs. + This function takes a list of lists containing string NER tags, creates a unique mapping + of these tags to integer IDs, and then converts all tags to their respective IDs. + Args: + ner_tags (List[List[str]]): A list of lists, where each inner list contains string NER tags. + Returns: + List[List[int]]: A list of lists, where each inner list contains integer IDs + corresponding to the input NER tags. + Example: + >>> ner_tags = [['O', 'B-PER', 'I-PER'], ['O', 'B-ORG']] + >>> mapped_ner_tags(ner_tags) + [[0, 1, 2], [0, 3]] + Note: + The mapping of tags to IDs is created based on the unique tags present in the input. + The order of ID assignment may vary between function calls if the input changes. + """ + labels = list(set([item for sublist in ner_tags for item in sublist])) + id2label = {i: label for i, label in enumerate(labels)} + label2id = {label: id_ for id_, label in id2label.items()} + mapped_ner_tags = [[label2id[label] for label in ner_tag] for ner_tag in ner_tags] + return mapped_ner_tags +``` + + +```python +def get_labels(ner_tags: List[List[str]]) -> List[str]: + """ + Extract unique labels from a list of NER tag sequences. + This function takes a list of lists containing NER tags and returns a list of unique labels + found across all sequences. + + Args: + ner_tags (List[List[str]]): A list of lists, where each inner list contains string NER tags. + Returns: + List[str]: A list of unique NER labels found in the input sequences. + Example: + >>> ner_tags = [['O', 'B-PER', 'I-PER'], ['O', 'B-ORG', 'I-ORG'], ['O', 'B-PER']] + >>> get_labels(ner_tags) + ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG'] + Note: + The order of labels in the output list is not guaranteed to be consistent + between function calls, as it depends on the order of iteration over the set. + """ + return list(set([item for sublist in ner_tags for item in sublist])) +``` + +#### Step 5: Argilla Dataset to HuggingFace Dataset +We now have our data in a structure as required for token classification dataset. We will just have to create a Hugging Face Dataset. + + +```python +train_labels = get_labels(train_ner_tags) +validation_labels = get_labels(validation_ner_tags) +labels = list(set(train_labels + validation_labels)) +features = Features({ + "tokens": Sequence(Value("string")), + "ner_tags": Sequence(ClassLabel(num_classes=len(labels), names=labels)) +}) +train_records = [ + { + "tokens": token, + "ner_tags": ner_tag, + } + for token, ner_tag in zip(train_tokens, mapped_ner_tags(train_ner_tags)) +] +validation_records = [ + { + "tokens": token, + "ner_tags": ner_tag, + } + for token, ner_tag in zip(validation_tokens, mapped_ner_tags(validation_ner_tags)) +] +span_dataset = DatasetDict( + { + "train": Dataset.from_list(train_records,features=features), + "validation": Dataset.from_list(validation_records,features=features), + } +) + +``` + + +```python +# assertion to verify if train split conforms the dataset structure required for fine-tuning. +assert span_dataset['train'].features['ner_tags'].feature.names is not None +``` + +#### Step 6: Push dataset to Hugginface Hub + + +```python +!huggingface-cli login +``` + + + _| _| _| _| _|_|_| _|_|_| _|_|_| _| _| _|_|_| _|_|_|_| _|_| _|_|_| _|_|_|_| + _| _| _| _| _| _| _| _|_| _| _| _| _| _| _| _| + _|_|_|_| _| _| _| _|_| _| _|_| _| _| _| _| _| _|_| _|_|_| _|_|_|_| _| _|_|_| + _| _| _| _| _| _| _| _| _| _| _|_| _| _| _| _| _| _| _| + _| _| _|_| _|_|_| _|_|_| _|_|_| _| _| _|_|_| _| _| _| _|_|_| _|_|_|_| + + To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens . + Enter your token (input will not be visible): + Add token as git credential? (Y/n) n + Token is valid (permission: write). + Your token has been saved to /root/.cache/huggingface/token + Login successful + + + +```python +span_dataset.push_to_hub("bikashpatra/sample_claims_annotated_hf") +``` + + + Uploading the dataset shards: 0%| | 0/1 [00:00DatasetRecords: The provided batch size 256 was normalized. Using value 10. + + + + + Sending records...: 100%|██████████| 1/1 [00:00<00:00, 1.15batch/s] + + + + + + DatasetRecords(Dataset(id=UUID('a187cdad-175e-4d87-989f-a529b9999bde') inserted_at=datetime.datetime(2024, 7, 28, 7, 23, 59, 902685) updated_at=datetime.datetime(2024, 7, 28, 7, 35, 55, 80617) name='claim_tokens' status='ready' guidelines='Classify individual tokens according to the specified categories, ensuring that any overlapping or nested entities are accurately captured.' allow_extra_metadata=False distribution=None workspace_id=UUID('fd4fc24c-fc1f-4ffe-af41-d569432d6b50') last_activity_at=datetime.datetime(2024, 7, 28, 7, 35, 55, 80181))) + + + +The records we update here are stored as [`suggestions`](https://docs.argilla.io/latest/reference/argilla/records/suggestions/) and not [`responses`](https://docs.argilla.io/latest/reference/argilla/records/responses/). Responses in the context of this tutorial are created when annotator saves a annotation.Suggestions are labels predicted by model.Therefore, the records we updated here will have `response.status` as `pending` and not `submitted`. This will allow us/annotators to check the predicted labels and accept or reject model predictions. + +If we want to accept model predicted annotations for tokens in a text, we may save the [`suggestions`] as [`responses`], else we will have to add / remove / edit labels applied to tokens. + +## 8. Conclusion + +In this comprehensive tutorial, we've explored a complete workflow for data annotation and model fine-tuning. We began by setting up an [Argilla](https://argilla.io/) instance on [Hugging Face Spaces](https://huggingface.co/spaces), providing a robust platform for data management. We then configured and created a dataset within our Argilla instance, leveraging its user-friendly interface to manually annotate a subset of records. + +We continued as we exported the high-quality annotated data to a Hugging Face [dataset](https://huggingface.co/datasets), bridging the gap between annotation and model training. We then demonstrated the power of transfer learning by fine-tuning a `distilbert-base-uncased` model on this curated dataset using Hugging Face's [AutoTrain](https://huggingface.co/autotrain), a tool that simplifies the complexities of model training. + +The workflow came full circle as we applied our fine-tuned model to annotate the remaining unlabeled records in the Argilla dataset, showcasing how machine learning can accelerate the annotation process. This tutorial should provide a solid foundation for implementing an iterative annotation and fine-tuning pipeline while illustrating the synergy between human expertise and machine learning capabilities. + +This iterative approach allows for continuous improvement, making it an invaluable tool for tackling a wide range of natural language processing tasks efficiently and effectively. + +## 9. Acknowledgements + +I would like to express my sincere gratitude to the following individuals who have contributed to this notebook: + +- **[David Berenstein](https://x.com/davidberenstei)** for his invaluable insights and guidance. +- **[Sara Han](https://x.com/sdiazlor)** for answering my frequent queries on discord. + +This work would not have been possible without their support and expertise. + +Additionally, a nicer version of this notebook can be seen by replacing **github** in `https://github.com/bikash119/argilla/blob/argilla_with_autotrain/argilla/docs/community/token_classification_tutorial.ipynb` with **nbsanity**. Thanks to **[Hamel Hussain](https://x.com/HamelHusain)** for creating this notebook rendering utility. + + +```python + +``` diff --git a/argilla/docs/community/token_classification_tutorial.qmd b/argilla/docs/community/token_classification_tutorial.qmd new file mode 100644 index 0000000000..04ee5980b5 --- /dev/null +++ b/argilla/docs/community/token_classification_tutorial.qmd @@ -0,0 +1,654 @@ +--- +jupyter: python3 +--- + +Open In Colab + +# Fine-tuning a token classification model using custom Argilla Dataset and HuggingFace AutoTrain + +We all would want to try out to solve some use case with a neat tool / techs available out there. +In this tutorial , I want to go over my learning journey to fine tune a model on US Patent text. + +## 1. Introduction + + +### 1.1 Background on Named Entity Recognition (NER) + +Named Entity Recognition (NER) is a fundamental task in Natural Language Processing (NLP) that involves identifying and classifying named entities in text into predefined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, etc. + +### 1.2 Importance of NER in Natural Language Processing + +NER plays a crucial role in various NLP applications, including: +- Information Retrieval +- Question Answering Systems +- Machine Translation +- Text Summarization +- Sentiment Analysis + +### 1.3 Challenges in NER for Specific Domains or Languages + +While general-purpose NER models exist, they often fall short when applied to specialized domains or less-common languages due to: +- Domain-specific terminology +- Unique entity types +- Language-specific nuances + +### 1.4 The Need for Custom, Fine-tuned Models + +To address these challenges, fine-tuning custom NER models becomes essential. This approach allows for: +- Adaptation to specific domains: A fine-tuned model can perform better on specific tasks or domains compared to general-purpose model. +- Efficiency: Fine-tuned models often require less data and computational resources to achieve good performance on specific tasks. +- Faster inference: Smaller, task-specific models run faster than larger, general purpose ones. + +### 1.5 Project Objectives and Overview + +In this project, we aim to fine-tune a custom NER model for USPTO Patents. Our objectives include: +- Use [Hugging Face Spaces](https://huggingface.co/spaces) to setup an instance of [Argilla](https://argilla.io/). +- Use [Argilla](https://argilla.io/) UI to annotate our dataset with custom labels. +- Use Hugging Face [AutoTrain](https://huggingface.co/autotrain) to create a more efficient model in terms of size and inference speed. +- Demonstrating the effectiveness of transfer learning in NER tasks. + +## Data Background + + + + +US Patent texts are typically long, descriptive documents about inventions. The data used in this tutorial can be accessed through the [Kaggle USPTO Competition](https://www.kaggle.com/competitions/uspto-explainable-ai). Each patent contains several fields: +- Title +- Abstract +- Claims +- Description + +For this tutorial, we'll focus on the `claims` field. + +## Problem Statement + + Our goal is to fine-tune a model to classify tokens in the `claims` field of a given patent. + +## Breaking Down the Problem + +To achieve this goal, we need: + +1. High-quality data to fine-tune a pretrained token classification model +2. Infrastructure to execute the training + +## Create High-Quality Data with Argilla + [Argilla](https://github.com/argilla-io/argilla/) is an excellent tool for creating high-quality datasets with a user-friendly interface for labeling. + + +### Setting Up Argilla on Hugging Face Spaces + +#### 1. Visit [Hugging Face Spaces deployment page](https://huggingface.co/new-space?template=argilla/argilla-template-space) + +#### 2. Create a new space: + - Provide a name + - Select `Docker` as Space SDK + - Choose `Argilla` as Docker Template + - Leave other fields empty for simplicity + - Click on `Create Space` + +#### 3. Restart the Space + +Now you have an Argilla instance running on Hugging Face Spaces. Click on the space you created to go to the login screen of Argilla UI. +Access the UI using the credentials: + +- Username: `admin` +- Password: `12345678` [default password] + +For more options and setting up the Argilla instance for production use-cases, please refer to [Configure Argilla on Huggingface](https://docs.argilla.io/dev/getting_started/how-to-configure-argilla-on-huggingface/) + +### Create a Dataset with Argilla Python SDK + +#### Step 1: Install & Import packages + +```{python} +#| jupyter: {outputs_hidden: true} +!pip install -U datasets argilla autotrain-advanced==0.8.8 > install_logs.txt 2>&1 +``` + +```{python} +#| colab: {base_uri: 'https://localhost:8080/'} +!autotrain --version +``` + +```{python} +import argilla as rg +import pandas as pd +import re +import os +import random +import torch +from IPython.display import Image, display,HTML +from datasets import load_dataset, Dataset, DatasetDict,ClassLabel,Sequence,Value,Features +from transformers import pipeline,TokenClassificationPipeline +from typing import List, Dict, Union,Tuple +from google.colab import userdata +``` + +#### Step 2: Initialize the Argilla Client +api_url: We can get this URL by using the `https://huggingface.co/spaces//?embed=true` + +```{python} +client = rg.Argilla( + api_url="https://bikashpatra-argilla-uspto-labelling.hf.space", + #api_url="https://-.hf.space # This is url to my public space. + api_key="admin.apikey", # default value. Shouldn't be used for production. + # headers={"Authorization": f"Bearer {HF_TOKEN}"} +) +#Replace `` and `` with your actual Hugging Face username and space name. +``` + +#### Step 3: Configure the Dataset +To configure an Argilla dataset for token classification task, we will have to + +1. Come up with labels specific to our problem domain: I came up with some labels by using the following prompt + >suggest me some labels like "Process", "Product", "Composition of Matter" which can be used to annotate tokens in the claims or description section of patents filed in US + +2. We need to configure fields/columns of our dataset and [`questions`](https://docs.argilla.io/latest/how_to_guides/dataset/#questions). The `questions` parameter allows you to instruct /guide the annotator on the task.In our usecase, we shall use `labels` we created for the annotators to select when annotating pieces (tokens) of text. + +```{python} +# Labels for token classification +labels = [ + "Process", "Product", "Composition of Matter", "Method of Use", + "Software", "Hardware", "Algorithm", "System", "Device", + "Apparatus", "Method", "Machine", "Manufacture", "Design", + "Pharmaceutical Formulation", "Biotechnology", "Chemical Compound", + "Electrical Circuit" +] + +# Dataset settings +settings = rg.Settings( + guidelines="Classify individual tokens according to the specified categories, ensuring that any overlapping or nested entities are accurately captured.", + fields=[ + rg.TextField(name="tokens", title="Text", use_markdown=True), + rg.TextField(name="document_id", title="publication_number", use_markdown=True), + rg.TextField(name="sentence_id", title="sentence_id", use_markdown=False) + ], + questions=[ + rg.SpanQuestion( + name="span_label", + field="tokens", + labels=labels, + title="Classify the tokens according to the specified categories.", + allow_overlapping=True + ) + ] +) +``` + +#### Step 4: Create dataset on Argilla instance +With the settings in places, we are ready to create our dataset using [`rg.Dataset`](https://docs.argilla.io/latest/how_to_guides/dataset/#create-a-dataset) api to create our dataset. + +```{python} +#| colab: {base_uri: 'https://localhost:8080/'} +# We name the dataset as claim_tokens +rg_dataset = rg.Dataset( + name="claim_tokens", + settings=settings, +) +rg_dataset.create() +``` + + After step 4 we should see the dataset created in Argilla UI. We can verify this by logging in to the Argilla UI `url https://huggingface.co/spaces/-.hf.space)` with the default credentials. + +We can look into the settings of the dataset by clicking on the settings icon next to the dataset name. + +```{python} +#| colab: {base_uri: 'https://localhost:8080/', height: 214} +def display_image(filename): display(Image(filename=filename)) + +display_image('/content/images/argilla_ds_list_settings.png') +``` + + The Fields tab of settings screen lists down fields we configured while creating the dataset using Python SDK. + +```{python} +#| colab: {base_uri: 'https://localhost:8080/', height: 659} +display_image('/content/images/argilla_ds_settings.png') +``` + +#### Step 5: Insert records to the Argilla datasets + +Data preparation notebook can be found [here](https://www.kaggle.com/code/boredmgr/claim-sampling) + +```{python} +#| colab: {base_uri: 'https://localhost:8080/', height: 112} +claims = pd.read_csv("/content/sample_publications.csv") +claims.head(2) +``` + +Here we are reading rows of the csv and mapping them to the fields we created during Argilla dataset configuration step. + +```{python} +#| colab: {base_uri: 'https://localhost:8080/', height: 146} +## We upload a csv with three columns : tokens, publication_number, sequence_id + +publication_df = pd.read_csv("/content/sample_publications.csv") +## Convert dataframe rows to Argilla Records +records = [ + rg.Record( + fields= + {"tokens": "".join(row["tokens"]) + ,'document_id':str(row['publication_number']) + ,'sentence_id':str(row['sequence_id']) + }) + for _,row in publication_df.iterrows() + ] + ## Store Argilla records to Argilla Dataset +rg_dataset.records.log(records) +``` + +Once, we have records pushed to Argilla Dataset, the UI will render the records and the labels for the annotator to annotate the text. + +Check the screeshots below. + +```{python} +#| colab: {base_uri: 'https://localhost:8080/', height: 378} +display_image("/content/images/annotation_screen.png") +``` + +#### Step 6 : Annotate tokens in every records with appropriate labels. +Login to the Argilla UI and start annotating. + +Argilla UI : `https://huggingface.co/spaces//` + +username : `admin` + +password : `12345678` + + + +> After annotating the data , we will have to convert Argilla Dataset to HuggingFace dataset in order to use HuggingFace AutoTrain for fine-tuning the model. HF AutoTrain allows training on CSV data too which can be uploaded from AutoTrain UI. But for this tutorial we will use Huggingface dataset. + + +## Argilla Dataset to HuggingFace Dataset + +#### Step 1: Load our annotated dataset + +```{python} +rg_dataset = client.datasets("claim_tokens") +``` + +#### Step 2 : Filter the rows / records which are annotated. +For us to have quick iterations on annotation and training, we should be able to annotate a few records and train our model.We can achieve it by using the [query/filter](https://docs.argilla.io/latest/how_to_guides/query/) functionality of Argilla Dataset. + +Using [`rg.Query()`](https://docs.argilla.io/latest/how_to_guides/query/) api we can filter the records which are annotated for preparing our training dataset. + +```{python} +#| colab: {base_uri: 'https://localhost:8080/'} +status_filter = rg.Query(filter=rg.Filter(("response.status", "==", "submitted"))) + +submitted = rg_dataset.records(status_filter).to_list(flatten=True) +submitted[0] +``` + +The annotated dataset cannot be fed as is to the model for fine-tuning. For token-classification task, we will have to make our data that adheres to the structure as described below. +- Dataset Structure: The dataset should typically have two main columns: + - `tokens`: A list of words/tokens for each example. + - `ner_tags`: A list of corresponding labels for each token. The labels must follow the [IOB labelling scheme](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)). +- Label Encoding: The labels should be integers, with each integer corresponding to a specific named entity tag. +Below functions will allow us to convert our Argilla dataset to the required dataset structure. + +```{python} +def get_iob_tag_for_token(token_start:int, token_end:int, ner_spans:List[Dict[str, Union[int, str]]]) -> str: + """ + Determine the IOB tag for a given token based on its position within NER spans. + + Args: + token_start (int): The start index of the token in the text. + token_end (int): The end index of the token in the text. + ner_spans (List[Dict[str, Union[int, str]]]): A list of dictionaries containing NER span information. + Each dictionary should have 'start', 'end', and 'label' keys. + + Returns: + str: The IOB tag for the token. 'B-' prefix for the beginning of an entity, + 'I-' for inside an entity, or 'O' for outside any entity. + """ + for span in ner_spans: + if token_start >= span["start"] and token_end <= span["end"]: + if token_start == span["start"]: + return f"B-{span['label']}" + else: + return f"I-{span['label']}" + return "O" + + +def extract_ner_tags(text:str, responses:List[Dict[str, Union[int, str]]]): + """ + Extract NER tags for tokens in the given text based on the provided NER responses. + + Args: + text (str): The input text to be tokenized and tagged. + responses (List[Dict[str, Union[int, str]]]): A list of dictionaries containing NER span information. + Each dictionary should have 'start', 'end', and 'label' keys. + + Returns: + List[str]: A list of IOB tags corresponding to each non-whitespace token in the text. + """ + tokens = re.split(r"(\s+)", text) + ner_tags = [] + current_position = 0 + for token in tokens: + if token.strip(): + token_start = current_position + token_end = current_position + len(token) + tag = get_iob_tag_for_token(token_start, token_end, responses) + ner_tags.append(tag) + current_position += len(token) + return ner_tags +``` + +#### Step 3: Get tokens and theirs respective annotations + +```{python} +def get_tokens_ner_tags(annotated_rows) -> Tuple[List[List[str]], List[List[str]]]: + """ + Extract tokens and their corresponding NER tags from annotated rows. + + This function processes a list of annotated rows, where each row contains + tokens and span labels. It splits the tokens and extracts NER tags for each token. + + Args: + annotated_rows (List[Dict[str, Union[str, List[Dict[str, Union[int, str]]]]]]): + A list of dictionaries, where each dictionary represents an annotated row. + Each row should have a 'tokens' key (str) and a 'span_label.responses' key + (List[Dict[str, Union[int, str]]]). + + Returns: + Tuple[List[List[str]], List[List[str]]]: A tuple containing two elements: + 1. A list of token lists, where each inner list represents tokens for a row. + 2. A list of NER tag lists, where each inner list represents NER tags for a row. + """ + tokens = [] + ner_tags = [] + for idx,row in enumerate(annotated_rows): + tags = extract_ner_tags(row["tokens"], row["span_label.responses"][0]) + tks = row["tokens"].split() + tokens.append(tks) + ner_tags.append(tags) + return tokens, ner_tags +train_tokens, train_ner_tags = get_tokens_ner_tags(submitted[:1]) +validation_tokens, validation_ner_tags = get_tokens_ner_tags(submitted[1:2]) +``` + +##### Vibe Check +Its always good to check our data after a few operations. This will help us understand and debug if the output of every steps results in desired output. + +```{python} +#| colab: {base_uri: 'https://localhost:8080/', height: 409} +display(HTML(''' + +''')) +display(HTML("
Sample Train Tokens:" +
+             f"{train_tokens[0]}

")) +display(HTML("
Sample Valid Tokens:" +
+             f"{validation_tokens[0]}

")) +display(HTML("
Sample Train tags:" +
+             f"{train_ner_tags[0]}

")) +display(HTML("
Sample Valid tags:" +
+             f"{validation_ner_tags[0]}
")) +``` + +As we are trying to have our data creation and model training pipeline working, for simplicity , I have dealing with one record each for training and validation. + +#### Step 4: Map labels (tags) to integers + +```{python} +def mapped_ner_tags(ner_tags: List[List[str]]) -> List[List[int]]: + """ + Convert a list of NER tags to their corresponding integer IDs. + This function takes a list of lists containing string NER tags, creates a unique mapping + of these tags to integer IDs, and then converts all tags to their respective IDs. + Args: + ner_tags (List[List[str]]): A list of lists, where each inner list contains string NER tags. + Returns: + List[List[int]]: A list of lists, where each inner list contains integer IDs + corresponding to the input NER tags. + Example: + >>> ner_tags = [['O', 'B-PER', 'I-PER'], ['O', 'B-ORG']] + >>> mapped_ner_tags(ner_tags) + [[0, 1, 2], [0, 3]] + Note: + The mapping of tags to IDs is created based on the unique tags present in the input. + The order of ID assignment may vary between function calls if the input changes. + """ + labels = list(set([item for sublist in ner_tags for item in sublist])) + id2label = {i: label for i, label in enumerate(labels)} + label2id = {label: id_ for id_, label in id2label.items()} + mapped_ner_tags = [[label2id[label] for label in ner_tag] for ner_tag in ner_tags] + return mapped_ner_tags +``` + +```{python} +def get_labels(ner_tags: List[List[str]]) -> List[str]: + """ + Extract unique labels from a list of NER tag sequences. + This function takes a list of lists containing NER tags and returns a list of unique labels + found across all sequences. + + Args: + ner_tags (List[List[str]]): A list of lists, where each inner list contains string NER tags. + Returns: + List[str]: A list of unique NER labels found in the input sequences. + Example: + >>> ner_tags = [['O', 'B-PER', 'I-PER'], ['O', 'B-ORG', 'I-ORG'], ['O', 'B-PER']] + >>> get_labels(ner_tags) + ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG'] + Note: + The order of labels in the output list is not guaranteed to be consistent + between function calls, as it depends on the order of iteration over the set. + """ + return list(set([item for sublist in ner_tags for item in sublist])) +``` + +#### Step 5: Argilla Dataset to HuggingFace Dataset +We now have our data in a structure as required for token classification dataset. We will just have to create a Hugging Face Dataset. + +```{python} +train_labels = get_labels(train_ner_tags) +validation_labels = get_labels(validation_ner_tags) +labels = list(set(train_labels + validation_labels)) +features = Features({ + "tokens": Sequence(Value("string")), + "ner_tags": Sequence(ClassLabel(num_classes=len(labels), names=labels)) +}) +train_records = [ + { + "tokens": token, + "ner_tags": ner_tag, + } + for token, ner_tag in zip(train_tokens, mapped_ner_tags(train_ner_tags)) +] +validation_records = [ + { + "tokens": token, + "ner_tags": ner_tag, + } + for token, ner_tag in zip(validation_tokens, mapped_ner_tags(validation_ner_tags)) +] +span_dataset = DatasetDict( + { + "train": Dataset.from_list(train_records,features=features), + "validation": Dataset.from_list(validation_records,features=features), + } +) +``` + +```{python} +# assertion to verify if train split conforms the dataset structure required for fine-tuning. +assert span_dataset['train'].features['ner_tags'].feature.names is not None +``` + +#### Step 6: Push dataset to Hugginface Hub + +```{python} +#| colab: {base_uri: 'https://localhost:8080/'} +!huggingface-cli login +``` + +```{python} +#| colab: {base_uri: 'https://localhost:8080/', height: 202, referenced_widgets: [e4f9cff7519b401a9db04f247b7e473c, 9b4e205349554a6fa38b50f19645181a, 9d85ae1441804dfba44528e1bc543a7d, 17c50ff23aa54b7ebb142c5df5615f78, f34a9eb89d804712bf6ee63dbc685017, 764dfd183d054eaa9ce458041be856ef, 9e7ada0da57c4f7a9994ae22e05407d5, d7aa2c0be22b41c2841047336bef2480, ebfe6ea4ec834638a0246cd40cda9ea4, d03837e77b97463687a54e7b81feb680, 70e15d43fccf47399525cbc6578618a2, 13636a22e24f4b1287bb4234f15f6a32, 3d6486dae69f4d50b471617044353d20, 2fb53aa82ac34a1fadac2a0a97840a15, 748b3a0616b544969b6ddaf8cd70092d, 945498057b214ee1a16de0391ebc87ab, 16343bcab2124dd28cfc5fccfb89f5a6, 2fdc2f2670824547a27ad9e298e10534, c47bc520c9684382a2c0e484a0f887ff, eafea0fa0b864434a723ec6391ef2876, 13e35558d8f0416fb99629ef3484bbab, 899f920fd1ea4ee2b230f89a337304ec, eec3edc793a84abeb1af22221b24b3c8, a7dfe44a19de4ffca626c2d610759297, c96f3210638442c195f0511b140793ba, c09bd90e783f4f5c88c2d57191fbdc19, 531daeb371ad4b51b1d493053d8b1e52, a1951348ae654b4583611c5e0425d81b, 34a5f08bf5f849c4829f3b83fd27bbec, 4a7301446bba44b798a70a27260e0d56, ccd15d0d08c743b08ac1419d25199a42, 11b6ad94ce8d419fb55a3817997037af, 4617899577864785a7a85be21c36fda4, 7ec8f069212f455589812568d35f7cde, b1266aa9efbd400b877df36a942f4337, b4daac723071481da2f3d7a5e1517b5f, c641704c716a40fcb3d54720cb4ff998, 4a39f8caa9214894a2a9e32cf0087105, 5e6fea490d71483b9fabf0d5f84f1934, e02a23c6f06f42639b82c65ce72246c2, 71449aa1769643c8b57094d35bc8b35f, 1303f70db20f45a39195b9872c8482ab, b01ba70e26da4b83b096c1b23da688d4, 35b2a6e0f8c543c4995e72d1543c39aa]} +span_dataset.push_to_hub("bikashpatra/sample_claims_annotated_hf") +``` + +## Model Fine-tuning using AutoTrain +Huggingface [AutoTrain](https://huggingface.co/autotrain) is a simple tool to train model without writing a any code. We can use autotrain to fine-tune for a range of tasks like token-classification, text-generation, Image Classification and many more. In order to use AutoTrain, we will have to first create an instance of AutoTrain in HF space. Use the [create space](https://huggingface.co/new-space?template=autotrain-projects%2Fautotrain-advanced) link. For space SDK choose Docker and select AutoTrain as Docker template. We need to choose a hardware to train our model. Check the screenshots for a quick reference + +```{python} +#| colab: {base_uri: 'https://localhost:8080/', height: 1000} +display_image("/content/images/autotrain_screen1.png") +display_image("/content/images/autotrain_screen2.png") +``` + +### Using AutoTrain UI + +After space creation, AutoTrain UI will allow us to select from range of tasks. We will have to configure our trainer on the AutoTrain UI. +1. We will select Token classification as our task. +2. For our tutorial we will fine-tune `google-bert/bert-base-uncased`. We can choose any model from the list. +3. For DataSource select `Hugging Face Hub` which will give us a text box to fill in the dataset which we want to use for fine-tuning. We will use the dataset we pushed to Huggingface hub. I will be using the dataset that I pushed to huggingface hub `bikashpatra/claims_annotated_hf` +4. Enter the keys for `train` and `validation` split. +5. Under Column Mapping , enter the columns which store the tokens and tags. In my dataset , tokens are stored in `tokens` column and labels are stored in `ner_tags` column. +With the above 5 inputs, we can trigger `Start Training` and AutoTrain will take care of fine-tuning the base model on our dataset. + +```{python} +#| colab: {base_uri: 'https://localhost:8080/', height: 625} +display_image("/content/images/autotrain_ui.png") +``` + +### Using AutoTrain CLI + +```{python} +# for this cell to work, you will have to store HF_TOKEN as secret in colab notebook. +os.environ['TOKEN'] = userdata.get('HF_TOKEN') +``` + +```{python} +#| colab: {base_uri: 'https://localhost:8080/'} +!autotrain token-classification --train \ + --username "bikashpatra" \ + --token $TOKEN \ + --backend "spaces-a10g-small" \ + --project-name "claims-token-classification" \ + --data-path "bikashpatra/sample_claims_annotated_hf" \ + --train-split "train" \ + --valid-split "validation" \ + --tokens-column "tokens" \ + --tags-column "ner_tags" \ + --model "distilbert-base-uncased" \ + --lr "2e-5" \ + --log "tensorboard" \ + --epochs "10" \ + --weight-decay "0.01" \ + --warmup-ratio "0.1" \ + --max-seq-length "256" \ + --mixed-precision "fp16" \ + --push-to-hub +``` + +AutoTrain automatically creates huggingface space for us and triggers the training job. Link to the space created is `https://huggingface.co/spaces/$JOBID where JOBID is the value that we get from the logs of autotrain cli command. + +If the model training executes without any errors, our model is available with the value we provided to `--project-name`. In the above example it was `claims-token-classification` + +## Inference +With all the hardwork done, we have our model trained our custom dataset.We can use our trained model to predict labels for un-annotated rows. +We will use [HF Pipelines](https://huggingface.co/docs/transformers/main_classes/pipelines) api. Pipelines are easy to use abstraction to load model and execute inference on un-seen data.In context of this tutorial _inference on un-seen text_ means predicting labels for tokens in un-annotated text. + +```{python} +# Classify a sample text +claims_text = """ +The FINFET of claim 11 , wherein the conformal gate dielectric comprises a high-κ gate dielectric selected from +the group consisting of: hafnium oxide (HfO 2 ), lanthanum oxide (La 2 O 3 ), and combinations thereof. +""" +classifier = pipeline("token-classification", model="bikashpatra/claims-token-classification",device="cpu") +preds = classifier(claims_text) +``` + +```{python} +#| colab: {base_uri: 'https://localhost:8080/'} +# The labels used for fine-tuning the model. +classifier.model.config.id2label +``` + +## Push predictions to Argilla Dataset +Using [`rg.Query`](https://docs.argilla.io/latest/how_to_guides/query/) api we filter un-annotated data and predict tokens. + +The filter `rg.Filter(("response.status","==","pending"))` allows us to create a Argilla filter which we pass to [`rg.Query`](https://docs.argilla.io/latest/how_to_guides/query/) to get us all the records in Argilla dataset which has not been annotated. + +```{python} +# Create a filter query to get only `pending` records in argilla dataset. +status_filter = rg.Query(filter=rg.Filter(("response.status", "==", "pending"))) + +submitted = rg_dataset.records(status_filter).to_list(flatten=True) +claims = random.sample(submitted,k=10) # pick 10 random samples. + +spans = classifier(claims[0]['tokens']) +``` + +### Helper function to predict the spans + +```{python} +def predict_spanmarker(pipe:TokenClassificationPipeline,text: str): + """ + Predict span markers for the given text using the provided pipeline. + Args: + pipe (TokenClassificationPipeline): A pipeline object for token classification. + text (str): The input text for which span markers are to be predicted. + Returns: + List[Dict[str, Union[int, str]]]: A list of dictionaries containing the predicted span markers. + Each dictionary should have 'start', 'end', and 'label' keys. + """ + markers = pipe(text) + spans = [ + {"label": marker["entity"][2:], "start": marker["start"], "end": marker["end"]} + for marker in markers if marker["entity"] != "O" + ] + return spans +``` + +```{python} +updated_data=[ + { + "span_label": predict_spanmarker(pipe=classifier, text=sample['tokens']), + "id": sample["id"], + } + for sample in claims +] +``` + +```{python} +#| colab: {base_uri: 'https://localhost:8080/'} +# print a few predictions +updated_data[0]['span_label'][:2] +``` + +### Insert records to Argilla Dataset. + +```{python} +#| colab: {base_uri: 'https://localhost:8080/', height: 137} +rg_dataset.records.log(records=updated_data) +``` + +The records we update here are stored as [`suggestions`](https://docs.argilla.io/latest/reference/argilla/records/suggestions/) and not [`responses`](https://docs.argilla.io/latest/reference/argilla/records/responses/). Responses in the context of this tutorial are created when annotator saves a annotation.Suggestions are labels predicted by model.Therefore, the records we updated here will have `response.status` as `pending` and not `submitted`. This will allow us/annotators to check the predicted labels and accept or reject model predictions. + +If we want to accept model predicted annotations for tokens in a text, we may save the [`suggestions`] as [`responses`], else we will have to add / remove / edit labels applied to tokens. + +## Conclusion + +In this comprehensive tutorial, we've explored a complete workflow for data annotation and model fine-tuning. We began by setting up an [Argilla](https://argilla.io/) instance on [Hugging Face Spaces](https://huggingface.co/spaces), providing a robust platform for data management. We then configured and created a dataset within our Argilla instance, leveraging its user-friendly interface to manually annotate a subset of records. + +We continued as we exported the high-quality annotated data to a Hugging Face [dataset](https://huggingface.co/datasets), bridging the gap between annotation and model training. We then demonstrated the power of transfer learning by fine-tuning a `distilbert-base-uncased` model on this curated dataset using Hugging Face's [AutoTrain](https://huggingface.co/autotrain), a tool that simplifies the complexities of model training. + +The workflow came full circle as we applied our fine-tuned model to annotate the remaining unlabeled records in the Argilla dataset, showcasing how machine learning can accelerate the annotation process. This tutorial should provide a solid foundation for implementing an iterative annotation and fine-tuning pipeline while illustrating the synergy between human expertise and machine learning capabilities. + +This iterative approach allows for continuous improvement, making it an invaluable tool for tackling a wide range of natural language processing tasks efficiently and effectively. + + diff --git a/argilla/docs/community/token_classification_tutorial_files/token_classification_tutorial_25_0.png b/argilla/docs/community/token_classification_tutorial_files/token_classification_tutorial_25_0.png new file mode 100644 index 0000000000..595f17bf02 Binary files /dev/null and b/argilla/docs/community/token_classification_tutorial_files/token_classification_tutorial_25_0.png differ diff --git a/argilla/docs/community/token_classification_tutorial_files/token_classification_tutorial_27_0.png b/argilla/docs/community/token_classification_tutorial_files/token_classification_tutorial_27_0.png new file mode 100644 index 0000000000..1f33f6993f Binary files /dev/null and b/argilla/docs/community/token_classification_tutorial_files/token_classification_tutorial_27_0.png differ diff --git a/argilla/docs/community/token_classification_tutorial_files/token_classification_tutorial_34_0.png b/argilla/docs/community/token_classification_tutorial_files/token_classification_tutorial_34_0.png new file mode 100644 index 0000000000..0ee2e526b7 Binary files /dev/null and b/argilla/docs/community/token_classification_tutorial_files/token_classification_tutorial_34_0.png differ diff --git a/argilla/docs/community/token_classification_tutorial_files/token_classification_tutorial_59_0.png b/argilla/docs/community/token_classification_tutorial_files/token_classification_tutorial_59_0.png new file mode 100644 index 0000000000..94e0c2a1c6 Binary files /dev/null and b/argilla/docs/community/token_classification_tutorial_files/token_classification_tutorial_59_0.png differ diff --git a/argilla/docs/community/token_classification_tutorial_files/token_classification_tutorial_59_1.png b/argilla/docs/community/token_classification_tutorial_files/token_classification_tutorial_59_1.png new file mode 100644 index 0000000000..fa0e5b3fdd Binary files /dev/null and b/argilla/docs/community/token_classification_tutorial_files/token_classification_tutorial_59_1.png differ diff --git a/argilla/docs/community/token_classification_tutorial_files/token_classification_tutorial_62_0.png b/argilla/docs/community/token_classification_tutorial_files/token_classification_tutorial_62_0.png new file mode 100644 index 0000000000..c3b3040808 Binary files /dev/null and b/argilla/docs/community/token_classification_tutorial_files/token_classification_tutorial_62_0.png differ