Skip to content

Tags: scrapinghub/mrjob

Tags

v0.5.8

upload_dirs, pre-filters

 * automatically tarball and upload directories with --dir, setup hooks (Yelp#23)
 * specify path for inter-step output with --step-output-dir Yelp#263
 * jobs:
   * better --help printout
   * deprecated option groups in MRJobs
   * deprecated MRJob.get_all_option_groups()
   * overriding *_pre_filter() methods in MRJob works again (Yelp#1521)
   * all step types accept jobconf (Yelp#1447)
   * quieted warning about SORT_VALUES on Hadoop 2 (Yelp#1286)
 * all runners:
   * wrap tasks that require pipes with sh_bin, not bash (Yelp#1330)
 * local runner:
   * allows non-zero exit status from pre-filters (Yelp#1524)
   * pre-filters can now handle compressed input (Yelp#1061)
 * EMR runner:
   * fetch logs from task nodes as well as core nodes (Yelp#1400)
     * use ListInstances rather than dfsadmin to get node list (Yelp#1345)
 * moved mrjob.util.bunzip2_stream() to mrjob.cat
 * moved mrjob.util.gunzip_stream() to mrjob.cat
 * deprecated:
   * mrjob.util.args_for_opt_dest_subset()
   * mrjob.util.bash_wrap()
   * mrjob.util.populate_option_groups_with_options()
   * mrjob.util.scrape_options_and_index_by_dest()
   * mrjob.util.tar_and_gz()
   * SSHFilesystem.ssh_slave_hosts()

v0.5.7

Spark

 * EMR and Hadoop runners:
   * full support for Spark (Yelp#1320)
     * includes spark() method in MRJob and SparkStep/SparkScriptStep
   * can use environment variables and ~ in hadoop_streaming_jar option
 * EMR runner:
   * default AMI version is now 4.8.2 (Yelp#1486)
   * default instance type is m1.large when running Spark jobs (Yelp#1465)
   * added debug logging for matching available pooled clusters (Yelp#1449)
   * defaults to cheapest instance type that will work (Yelp#1369)
   * master bootstrap script always created when pooling
   * no longer crashes when trying to use missing ssh binary (Yelp#1474)
   * pooled clusters may have 1000 steps (Yelp#1463)
   * failed jobs no longer reported as 100% complete (Yelp#793)
 * All runners:
   * py_files option for Spark and streaming steps (Yelp#1375)
   * bootstrap mrjob with a .zip rather than a tarball
   * options refactor, added missing command-line switches (Yelp#1439)
 * mrjob terminate-idle-clusters works with all step types (Yelp#1363)
 * log interpretation
   * dropped unnecessary container-to-attempt-ID mapping (Yelp#1487)
   * more efficient search for task log errors (Yelp#1450)
   * cleaner error messages when bootstrapped mrjob won't compile
 * JarSteps
   * now support libjars, jobconf (Yelp#1481)
   * JarStep.{INPUT,OUTPUT} are deprecated (use mrjob.step.{INPUT,OUTPUT})
 * is_uri() now only matches URIs containing "://" (Yelp#1455)
 * works in Anaconda3 Jupyter Notebook (Yelp#1441)
 * deprecated mrjob.parse.is_windows_path()
 * deprecated mrjob.parse.parse_key_value_list()
 * deprecated mrjob.parse.parse_port_range_list()
 * deprecated mrjob.util.scrape_options_into_new_groups()
 * deprecated non-strict protocols (Yelp#1452)
 * deprecated python_archives (Yelp#1056)

v0.5.6

dataproc crash fix

 * Dataproc runner:
   * fix Hadoop version crash on unknown image version (Yelp#1428)
 * EMR and Hadoop runners:
   * prioritize task errors as probable cause of failure (Yelp#1429)
   * ignore Java stack trace in task stderr logs (Yelp#1430)

v0.5.5

missing ami_version option

 * EMR runner:
   * deprecate, don't remove ami_version option in v0.5.4 (Yelp#1421)
   * update memory/CPU stats for EC2 instances for pooling (Yelp#1414)
   * pooling treats application names as case-insensitive (Yelp#1417)

v0.5.4

pooling auto-recovery

 * jobs:
   * pass_through_option(), for existing command-line options (Yelp#1075)
     * MRJob.options.runner now defaults to None, not 'inline' or 'local'
 * runners:
   * all:
     * names of uploaded files now never start with . or _ (Yelp#1200)
   * Hadoop:
     * log parsing:
       * handles more log4j patterns (Yelp#1405)
       * gracefully handles IOError from exists() (Yelp#1355)
     * fixed crash bug in Hadoop FS on Python 3 (Yelp#1396)
   * EMR:
     * pooling auto-recovers from joining a cluster that self-terminated (Yelp#708)
     * log fetching uses sudo on 4.3.0+ AMIs (Yelp#1244)
     * fixed broken --ssh-bind-ports switch (Yelp#1402)
     * idle termination script now only runs on master node (Yelp#1398)
     * ssh tunnel connects to internal IP of resource manager (Yelp#1397)
     * AWS credentials no longer logged in verbose mode (Yelp#1353)
     * many option names are now more generic (Yelp#1247)
       * ami_version -> image_version
       * aws_availability_zone -> zone
       * aws_region -> region
       * check_emr_status_every -> check_cluster_every
       * ec2_core_instance_bid_price -> core_instance_bid_price
       * ec2_core_instance_type -> core_instance_type
       * ec2_instance_type -> instance_type
       * ec2_master_instance_bid_price -> master_instance_bid_price
       * ec2_master_instance_type -> master_instance_type
       * ec2_task_instance_bid_price -> task_instance_bid_price
       * ec2_task_instance_type -> task_instance_type
       * emr_tags -> tags
       * num_ec2_core_instances -> num_core_instances
       * num_ec2_task_instances -> num_task_instances
       * s3_log_uri -> cloud_log_dir
       * s3_sync_wait_time -> cloud_fs_sync_secs
       * s3_tmp_dir -> cloud_tmp_dir
       * s3_upload_part_size -> cloud_upload_part_size
     * num_ec2_instances is deprecated (use num_core_instances)
     * ec2_slave_instance_type is deprecated (use core_instance_type)
     * hadoop_streaming_jar_on_emr is deprecated (Yelp#1405)
       * hadoop_streaming_jar handles this instead with file:// URIs
     * bootstrap_python does nothing on AMI 4.6.0+, as not needed (Yelp#1358)
 * mrjob audit-emr-usage should show less/no API throttling warnings (Yelp#1091)

v0.5.3

libjars

 * jobs:
   * LIBJARS and libjars method (Yelp#1341)
 * runners:
   * all:
     * .cpython-3*.pyc files no longer included when bootstrapping mrjob
   * local:
     *PATH envvars combined with local separator (Yelp#1321)
   * Hadoop and EMR:
     * libjars option (Yelp#198)
     * fixes to ordering of generic and JAR-specific options (Yelp#1331, Yelp#1332)
   * Hadoop:
     * more default log dirs (Yelp#1339)
     * hadoop_tmp_dir handles ~ and envvars (Yelp#1322) (broken in v0.5.0)
   * EMR:
     * determine cause of failure of bootstrap scripts (Yelp#370)
       * master bootstrap script now redirects stdout to stderr
     * emr_configurations option (Yelp#1276)
     * subnet option (Yelp#1323)
     * SSH tunnel opened as soon as cluster is ready (Yelp#1115)
     * SSH tunnel leaves stdin alone (Yelp#1161)
 * combine_lists() treats dicts as values, not sequences

v0.5.2

initial Cloud Dataproc support

 * basic support for Google Cloud Dataproc (Yelp#1243)
   * lacks log interpretation, JarStep support
 * on EMR, wait for steps to complete in correct order (Yelp#1316)
 * correctly handle ~ in include path in mrjob.conf (Yelp#1308)
 * new emr_applications option (Yelp#1293)
 * fix running deprecated tools with python -m (Yelp#1312)
 * fix ssh tunneling to 2.x AMIs on EMR in VPCs (Yelp#1311)

v0.5.1

post-release bugfixes

 * strict_protocols in mrjob.conf is no longer ignored (Yelp#1302)
 * check_input_paths in mrjob.conf is no longer ignored
 * partitioner() is no longer ignored, fixing SORT_VALUES (Yelp#1294)
   * --partitioner switch is deprecated
 * improved probable cause of error from pre-YARN logs (Yelp#1288)
 * ssh_bind_ports now defaults to (x)range, not list (Yelp#1284)
 * mrjob terminate-idle-clusters handles debugging jar from boto 2.40.0 (Yelp#1306)

v0.5.0

the future is in the past

 * supports Python 3 (Yelp#989)
 * requires boto 2.35.0 or newer (Yelp#980)
   * removed many workarounds for S3 and EMR (Yelp#980), IAM (Yelp#1062)
 * jobs:
   * is_mapper_or_reducer() is now is_task() (Yelp#1072)
   * mr() no longer takes positional arguments (Yelp#814)
   * removed jar() (use mrjob.step.JarStep)
   * removed testing methods parse_counters() and parse_output()
   * protocols:
     * protocols are strict by default (Yelp#724)
     * JSON protocols use ujson when available, then simplejson (Yelp#1002, Yelp#1266)
       * can explicitly choose Standard, Simple or Ultra JSON protocol
     * raw protocols handle bytes or unicode depending on Python version
       * can explicitly choose Text or Bytes protocol
   * mrjob.step:
      * JarStep only takes "args" and "main_class" keyword args
      * removed MRJobStep (use MRStep)
 * runners:
   * All runners:
     * totally revamped log handling (Yelp#1123)
     * runner status/log messages are less noisy (Yelp#1044)
     * don't bootstrap mrjob if interpreter is set (Yelp#1041)
     * fs methods path_exists() and path_join() are now exists() and join()
     * deprecation warning: use runner.fs explicitly (Yelp#1146)
     * changes to cleanup options:
       * removed IS_SUCCESSFUL (use ALL)
       * LOCAL_SCRATCH is now LOCAL_TMP (Yelp#318)
       * new HADOOP_TMP option handles HDFS cleanup (Yelp#1261)
       * REMOTE_SCRATCH is now CLOUD_TMP (Yelp#1261)
     * base_tmp_dir option is now local_tmp_dir (Yelp#318)
     * non-inline runners raise StepFailedException on step failure (Yelp#1219)
     * steps_python_bin defaults to current python interpreter (Yelp#1038)
     * _job_name is now _job_key (Yelp#982)
   * EMR:
     * default AWS region is us-west-2 (Yelp#1025)
     * default instance type is m1.medium (Yelp#992)
     * visible_to_all_users defaults to true (Yelp#1016)
     * matches your minor version of Python 2 on 3.x and 4.x AMIs (Yelp#1265)
     * 4.x AMIs are supported (Yelp#1105)
       * added --release-label switch (--ami-version 4.x.y also works)
     * can fetch counters and probable cause of failure on 3.x and 4.x AMIs
     * SSH tunnel now works on 3.x and 4.x AMIs (Yelp#1013)
       * ssh_tunnel_to_job_tracker option is now ssh_tunnel
     * correctly fetch step logs by step ID (Yelp#1117)
     * bootstrap_python option
     * s3_scratch_uri option is now s3_tmp_dir (Yelp#318)
     * aws_region is no longer inferred from s3_tmp_dir
     * create/select temp bucket in same region as EMR jobs (Yelp#687)
     * added iam_endpoint option (Yelp#1067)
     * removed s3_conn args from methods in EMRJobRunner and S3Filesystem
     * S3 Filesystem:
       * connect to each S3 bucket on appropriate endpoint (Yelp#1028)
         * fall back to default if we can't get bucket location (Yelp#1170)
       * removed special treatment of _$folder$ keys
         * removed deprecated S3Filesystem method get_s3_folder_keys()
       * recurse "subdirectories" even if uri lacks trailing / (Yelp#1183)
     * removed iam_job_flow_role option (use iam_instance_profile)
     * custom hadoop_streaming_jar gets properly uploaded
     * job cleanup temporarily disabled (Yelp#1241)
     * pooling respects key pair (Yelp#1230)
     * idle cluster self-termination respects non-streaming jobs (Yelp#1145)
     * deprecated "latest" AMI version not passed through to EMR (Yelp#1269)
     * emr_job_flow_id option is now cluster_id (Yelp#1082)
     * emr_job_flow_pool_name is now pool_name (Yelp#1082)
     * pool_emr_job_flows is now pool_clusters (Yelp#1082)
   * Hadoop
     * works out-of the-box on most Hadoop setups (Yelp#1160)
     * works out-of the box inside EMR (2.x, 3.x, and 4.x AMIs)
     * counters are parsed from Hadoop binary stderr in YARN (Yelp#1153)
     * can find logs and probable cause of failure in YARN (Yelp#1195)
       * will search in <output dir>/_logs, to support Cloudera (Yelp#565)
     * HDFS Filesystem:
       * use fs -ls -R and fs -rm -R in YARN (Yelp#1152)
       * mkdir() now uses -p on YARN (Yelp#991)
       * fs.du() now works on YARN (Yelp#1155)
       * fs.du() now returns 0 for nonexistent files instead of erroring
       * fs.rm() now uses -skipTrash
     * dropped support for Hadoop prior to 0.20.203 (Yelp#1208)
     * added hadoop_log_dirs option
     * hdfs_scratch_dir option is now hadoop_tmp_dir (Yelp#318)
     * hadoop_home is deprecated
     * uses -D and correct property name when step has no reduces (Yelp#1213)
   * Inline/Local
     * runner.fs raises IOError if passed URIs (Yelp#1185)
     * version-agnostic by default (Yelp#735)
     * removed ignored hadoop_extra_args and hadoop_streaming_jar opts (Yelp#1275)
     * inline runner uses multiple splits by default (Yelp#1276)
 * removed mrjob.compat.get_jobconf_value() (use jobconf_from_env())
 * removed mrjob.compat methods to support Hadoop prior to 0.20.203:
   * supports_combiners_in_hadoop_streaming()
   * supports_new_distributed_cache_options()
   * uses_generic_jobconf()
 * removed mrjob.conf.combine_cmd_lists()
 * removed fetch-logs tool (Yelp#1127)
 * mrjob subcommands use "cluster" rather than "job-flow" (Yelp#1082)
   * create-job-flow is now create-cluster
   * terminate-idle-job-flows is now terminate-idle-clusters
   * terminate-job-flow is now terminate-cluster
 * Python-version-specific mrjob-x and mrjob-x.y commands (Yelp#1104)
 * use followlinks=True with os.walk()
 * all internal constants/functions/methods explicitly start with _ (Yelp#681)
 * mrjob.util:
   * file_ext() takes filename, not path
   * random_identifier() moved here from mrjob.aws
   * buffer_iterator_to_line_iterator() is now to_lines()
     * to_lines() no longer appends a newline to data (Yelp#819)
   * removed extract_dir_for_tar()
   * gunzip_stream() now yields chunks, not lines
   * removed hash_object()

v0.4.6

config files

 * PyYAML>=3.08 is required
 * !clear tag in conf files (Yelp#1162)
 * combine_lists() and combine_path_lists() can handle scalars (Yelp#1172)
 * include: paths in conf files are relative to real path of conf file (Yelp#1166)
 * mrjob.conf.combine_cmd_lists() is deprecated (Yelp#1168)
 * EMR runner: pool_wait_minutes can now be loaded from mrjob.conf (Yelp#1070)