Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Baseline IRs for translating ADF pipelines to workflows #1235

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

ghanse
Copy link

@ghanse ghanse commented Nov 25, 2024

Summary

This PR adds intermediate representations of ADF pipelines, activities, and linked services from the APIs. This is a required baseline for translating ADF pipelines to Databricks workflows.

Details

  • Added the orchestrators.adf package
  • Created PipelineNode for any nodes coming from ADF
  • Created classes to represent the ADF API objects (e.g. Pipelines, Activities, and Linked Services)

Notes

  • This is a larger PR than I would like, but minimally implements the IRs needed to translate 1 pipeline
  • Can you validate my pattern for implementing the children method? (e.g. in DatabricksNotebookActivity)

@ghanse ghanse requested a review from a team as a code owner November 25, 2024 19:16
Copy link
Contributor

@vil1 vil1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be refactored as to avoid Option fields as much as possible.

Comment on lines 7 to 16
dependsOn: Seq[ActivityDependency],
description: Option[String],
linkedServiceName: Option[LinkedServiceReference],
name: Option[String],
onInactiveMarkAs: Option[OnInactiveMarkAs],
policy: Option[ActivityPolicy],
state: Option[ActivityState],
activityType: Option[String],
activityProperties: Option[ActivityProperties],
userProperties: Seq[UserProperty])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Case classes like this one where all fields are Seq or Option almost always indicate a design problem. As a rule of thumb, one should ask themselves "what would be the meaning of Activity(Seq(), None, None, None, None, None, None, None, None, Seq())?".

Option in a case class field should only be used when there is a meaningful situation where the field could be absent.

For example, is it possible for an Activity to have no state? Note that isn't exactly the same question as "can an Activity be neither Active nor Inactive?". If the answer to the latter question is "yes", then we should add a case object Unknown extends Activity for example and define the field as state: ActivityState = Unknown.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made the name, activityType, activityProperties, and state fields required. Every activity should have at minimum a name, type, set of type-specific properties, and state.

Many of the other fields are configurations which will be optionally returned by the API client.

Comment on lines 5 to 11
case class LibraryDefinition(
jarFilePath: Option[String],
eggFilePath: Option[String],
whlFilePath: Option[String],
mavenSpecification: Option[MavenSpecification],
pyPiSpecification: Option[PyPiSpecification],
cranSpecification: Option[CranSpecification])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the comment above, here it is quite clear that we have three mutually exclusive cases mashed together.
We should rather define it as

sealed trait LibraryDefinition
case class MavenLibrary(jarFilePath: String, mavenSpecification: MavenSpecification) extends LibraryDefinition
case class PyPiLibrary(eggFilePath: String, piPySpecification: PyPiSpecification) extends LibraryDefinition
case class CranLibrary(whlFilePath: String, cranSpecification: CranSpecification) extends LibraryDefinition

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done.

Comment on lines 5 to 7
abstract class ActivityProperties(name: Option[String]) extends PipelineNode {
override def children: Seq[PipelineNode] = Seq()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
abstract class ActivityProperties(name: Option[String]) extends PipelineNode {
override def children: Seq[PipelineNode] = Seq()
}
trait ActivityProperties extends PipelineNode {
def name: String
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done. I will note that there are no common fields across ActivityProperties of different types. Each activity type will have a unique set of properties. Is it OK to use a trait in this way?

Comment on lines 5 to 12
case class DatabricksNotebookActivity(
name: Option[String],
baseParameters: Map[String, String],
libraries: Seq[LibraryDefinition],
notebookPath: Option[String])
extends ActivityProperties(name) {
override def children: Seq[PipelineNode] = super.children ++ libraries
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
case class DatabricksNotebookActivity(
name: Option[String],
baseParameters: Map[String, String],
libraries: Seq[LibraryDefinition],
notebookPath: Option[String])
extends ActivityProperties(name) {
override def children: Seq[PipelineNode] = super.children ++ libraries
}
case class DatabricksNotebookActivity(
override val name: String,
baseParameters: Map[String, String],
libraries: Seq[LibraryDefinition],
notebookPath: Option[String])
extends ActivityProperties {
override def children: Seq[PipelineNode] = libraries
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done.

Comment on lines 18 to 20
override def children: Seq[PipelineNode] = Seq() ++ dependsOn ++ linkedServiceName ++
policy ++ userProperties
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are sure that linkedServiceName, policy and userProperties are optional

Suggested change
override def children: Seq[PipelineNode] = Seq() ++ dependsOn ++ linkedServiceName ++
policy ++ userProperties
}
override def children: Seq[PipelineNode] = dependsOn ++ linkedServiceName.toSeq ++
policy.toSeq ++ userProperties.toSeq
}

otherwise, we should go for

override def children: Seq[PipelineNode] = dependsOn ++ Seq(linkedServiceName, policy, userProperties)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done. I made fields required and used the last pattern.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants