# | KNIME Node | Module | Notes |
---|---|---|---|
1 | Color Manager | color_manager.py | KNIME color annotations are UI metadata and have no native representation in pandas; we therefore forward the input table unchanged to all outputs. |
2 | Column Appender | column_appender.py | - Settings read: selected_rowid_mode, selected_rowid_table, selected_rowid_table_number (base suffix defaults to "_r"; final suffix becomes f"{base}{k}" per right table). - Alignment: IDENTICAL → index join; other modes → positional concat with reset index. |
3 | Column Filter (exclude-only) | column_filter.py | - Excludes are parsed heuristically from settings.xml by scanning <config> blocks whose keys contain "exclude", collecting list entries (<entry key='0' value='Col'/> or <entry key='name'/>). - Dropping uses errors='ignore' so missing columns won't fail the cell. - If no excludes are found, the node is a passthrough. |
4 | Column Renamer | column_renamer.py | • Supports only explicit (old → new) mappings from settings.xml. • No pattern/regex templating, no type-based renames, no column reordering. |
5 | Concatenate | concatenate.py | - No suffixing or renaming of columns. - No column intersection logic; pandas default union alignment is used. - Row index is reset (0..N-1) via ignore_index=True. |
6 | CSV Reader | csv_reader.py | pandas>=1.5 recommended (nullable dtypes supported in dtype mapping). Quote/escape are passed to pandas. If escapechar equals quotechar, we omit escapechar and rely on double-quote parsing (avoids C-engine "EOF inside string" errors). Dtype mapping is derived from table_spec_config_Internals; unknown types are left to inference. Path resolution supports LOCAL and RELATIVE knime.workflow only; other FS types are not yet handled. Robust NA/dtype handling: - Treat '' and ' ' as missing on read (na_values=['', ' '], keep_default_na=True, skipinitialspace=True) - Read WITHOUT dtype=..., then coerce per-column: * numeric targets ('Int64', 'Float64') via pd.to_numeric(..., errors='coerce').astype(target) * other types via .astype(target) |
7 | CSV Writer | csv_writer.py | pandas>=1.5 recommended for consistent NA/nullable dtype handling. Path resolution supports LOCAL absolute paths and RELATIVE knime.workflow; other FS types are not yet handled. Directory creation is not automatic; ensure out_path.parent exists before writing. Line terminator / quoting mode / doublequote / escapechar are not explicitly mapped unless present; pandas defaults apply. File is overwritten by default; KNIME “append/overwrite” style flags are not implemented here. |
8 | Decision Tree Learner | decision_tree_learner.py | Pruning options (e.g., pruningMethod/Reduced Error Pruning) are not available in sklearn DT; consider ccp_alpha for cost-complexity pruning if needed. First-split constraints and binary nominal split settings are not supported by sklearn. Feature importances are impurity-based (Gini/entropy); consider permutation importances if you need model-agnostic measures. Library expectations: pandas>=1.5, numpy>=1.23, scikit-learn>=1.2 recommended. |
9 | Decision Tree Predictor | decision_tree_predictor.py | The estimator itself must be scikit-learn-like. Scope: classification predictor only. Multi-output and regression variants are not handled |
10 | Equal Size Sampling | equal_size_sampling.py | Exact mode only: “Approximate” sampling is not implemented in this generator. Requires pandas; scikit-learn is used only for resample() (no synthetic example generation). Seed is used when provided; default fallback is 1 for deterministic output. Order of rows after concatenation is re-sorted back to the original index. |
11 | Excel Reader | excel_reader.py | Covered mappings (KNIME → pandas): • Path: LOCAL & RELATIVE (knime.workflow) via resolve_reader_path() • Sheet selection: sheet_selection ∈ {FIRST, NAME, INDEX} → sheet (0 | 'name' | index) • Header: table_contains_column_names + column_names_row_number → header (0-based) or None • Column range: read_from_column/read_to_column → usecols="A:D" (Excel A1-style column span) • Row range: read_from_row/read_to_row → skiprows / nrows (best-effort) • Dtypes: table_spec_config_Internals → dtype mapping (nullable pandas dtypes when possible) • Replace empty strings with missings: advanced_settings.replace_empty_strings_with_missings |
12 | Excel Writer | excel_writer.py | - Only XLSX is supported (engine='openpyxl'). Legacy XLS (xls) is not implemented. - KNIME-style row-wise append into an existing sheet is not fully replicated. Pandas does not support true “append to bottom” without custom openpyxl manipulation. - We honor if_sheet_exists and header flags but do not append rows. - Auto-size columns, print layout, formula evaluation, and “open file after exec” are not supported. |
13 | Gradient Boosted Trees (Classification) Learner | gbt_learner.py | - Feature selection: use included_names if present; otherwise all numeric/boolean columns except the target. Excluded_names are removed afterward. If no target is configured, the node is a passthrough: bundle=None and empty outputs with an error note in the summary. - Hyperparameters mapped: nrModels→n_estimators, learningRate→learning_rate, maxLevels (-1/absent → default 3)→max_depth, minNodeSize→min_samples_split (≥2), minChildSize→min_samples_leaf (≥1), dataFraction (0<≤1)→subsample (stochastic GB), columnSamplingMode→max_features (None/'sqrt'/'log2'/fraction/int), seed→random_state. Seed defaults to 1 for deterministic output. - Unsupported/orthogonal flags: splitCriterion (trees in sklearn GBT have fixed criterion), missingValueHandling (impute beforehand), useAverageSplitPoints, useBinaryNominalSplits, isUseDifferentAttributesAtEachNode (no direct sklearn analog). These are noted and ignored. - Outputs: port 1=model bundle (estimator, metadata), port 2=feature_importances_, port 3=summary. - Dependencies: lxml for XML parsing; pandas/numpy for data handling; scikit-learn for modeling. - KNIME seeds can be > 2**32-1. We now coerce to a valid sklearn seed: seed32 = None if seed is None else int(abs(int(seed)) % (2**32)) |
14 | Gradient Boosted Trees (Classification) Predictor | gbt_predictor.py | - Bundle keys (if present): {'estimator','features','target','classes',...}; falls back gracefully to bare estimator and infers features if needed (raises KeyError if required columns are missing). - Prediction column name: custom if configured, else "Prediction (<target>)". - Probabilities: adds per-class "P (<target>=<class>)<suffix>" when predict_proba is available; may also append "<prediction> (confidence)" as max probability. - Optional: append number of boosted estimators as "<prediction> (models)". - Ignored flag: 'useSoftVoting' (not applicable to sklearn GBT). |
15 | K Nearest Neighbor (single-node trainer + scorer) | knn.py | - Inputs: one table with a target column (classColumn) plus feature columns. - Feature selection: all numeric/boolean columns except the target. Values are coerced to numeric (invalid → NaN) and filled with 0.0 to satisfy KNN distance computations. - Hyperparameters: k (neighbors), weightByDistance → weights ('uniform'|'distance'). |
16 | Linear Correlation | linear_corellation.py | Settings honored (from settings.xml): - include-list: included_names / excluded_names + enforce_option (EnforceInclusion/EnforceExclusion) - pvalAlternative: TWO_SIDED | GREATER | LESS (re-scales p from two-sided if SciPy is available) - columnPairsFilter: COMPATIBLE_PAIRS | ALL_PAIRS (we compute numeric↔numeric only) |
17 | Logistic Regression Learner | logreg_learner.py | - Feature selection: use included_names if set; otherwise all numeric/boolean columns minus the target; then remove excluded_names. - Hyperparameter mapping: solver(KNIME→sklearn), maxEpoch→max_iter, epsilon→tol, seed→random_state. Target reference category is recorded as metadata only (no sklearn equivalent). |
18 | Logistic Regression Predictor | logreg_predictor.py | - Bundle keys (if present): {'estimator','features','target','classes',...}; falls back to a bare estimator and infers features if absent (raises KeyError if required columns are missing). - Prediction column name: custom if configured; otherwise "Prediction (<target>)". - Probabilities: when predict_proba exists, adds "P (<target>=<class>)<suffix>" columns. - XML quirks: reads KNIME’s misspelled keys verbatim (has_custom_predicition_name, include_probabilites, propability_columns_suffix). |
19 | Math Formula (JEP) | math_formula.py | - Only a small set of functions is mapped (ln, log, log10, sqrt, exp, round, ceil, floor). - Advanced JEP functions/operators not listed above are not translated. |
20 | Missing Value Handler | missing_value.py | - Integers: mean/median/mode fills are rounded and per-column recast to nullable Int64. - Skip branches always contain an executable statement (pass) to avoid IndentationError. - We never emit .fillna(None). |
21 | MLP Predictor | mlp_predictor.py | - Settings: "change prediction" (bool), "prediction column name" (string), "append probabilities" (bool), "class probability suffix" (e.g., "_AN"). |
22 | Naive Bayes Learner (GaussianNB with optional one-hot on categoricals) | naive_bayes_learner.py | Settings parsed (best effort): - classifyColumn (target) - threshold → var_smoothing for GaussianNB - minSdValue, minSdThreshold (not directly supported in sklearn; documented in meta) - maxNoOfNomVals (categorical columns with > N unique values are ignored) - skipMissingVals (True → drop rows with any missing among selected features; False → impute numeric with mean and keep dummy_na for categoricals) |
23 | Naive Bayes Predictor | naive_bayes_predictor.py | - Inputs: Port 1 → model bundle (dict with {'estimator','features','target','classes','meta': {...}}) or a bare sklearn estimator as fallback Port 2 → data table to score - Prediction column: • If settings["change prediction"] is True and a custom name is provided, uses it. • Otherwise defaults to "Prediction (<target>)". - Probabilities: • If enabled, adds "P (<target>=<class>)<suffix>" per class (suffix from settings, default "_NB"). - Feature matrix reconstruction: • Prefer bundle['features'] (order preserved). • Else use getattr(estimator, 'feature_names_in_', None). • Else build from data: numeric columns + one-hot for all non-numeric; then align: - add missing expected columns with 0, - drop extra columns not in features, using bundle['meta'] flags when available (e.g., skip_missing) for dummy_na policy. |
24 | Normalizer | normalizer.py | - Column selection: use included_names if set; else all numeric dtypes (Int*/int*/Float*/float*); drop excluded_names afterward. - Modes: MINMAX uses new-min/new-max (constant/empty columns map to new_min); ZSCORE uses (x-mean)/std (zero std → 0.0). |
25 | One to Many (One-Hot Encoding) | one_to_many.py | - Column selection: parsed from model/columns2Btransformed with EnforceInclusion/EnforceExclusion semantics, restricted to string-like dtypes (string/object/category). - Naming: new columns are prefixed with the source column and '=' separator (e.g., "Region=West") to avoid collisions when different columns share the same category label. - Missing values: not encoded (rows with NA get all zeros for that column’s dummies). - removeSources: if true, drops the original columns after expansion. |
26 | Partitioning | partitioning.py | - Implementation: sklearn.model_selection.train_test_split; seed honored when provided. - STRATIFIED: uses class_column; NaN treated as a separate class; falls back to non-stratified if stratification is infeasible (e.g., tiny classes). - RELATIVE: fraction is clamped to [0,1]. ABSOLUTE: train_size is an integer bounded by len(df). |
27 | Random Forest (Classification) Learner | random_forest_learner.py | - Feature selection: use included_names if provided; otherwise all numeric/boolean columns except the target; excluded_names are removed afterward. - Hyperparameter mapping: nrModels→n_estimators; maxLevels>0→max_depth else None; minNodeSize→min_samples_split; minChildSize→min_samples_leaf; isDataSelectionWithReplacement→bootstrap; dataFraction→max_samples (only when bootstrap=True); columnSamplingMode/columnFractionPerTree/columnAbsolutePerTree plus isUseDifferentAttributesAtEachNode→max_features ('sqrt'/'log2'/1.0/fraction/int); seed→random_state. - Info-only flags (not applied in sklearn RF): splitCriterion, missingValueHandling, useAverageSplitPoints, useBinaryNominalSplits; noted and ignored. |
28 | Random Forest (Classification) Predictor | random_forest_predictor.py | - Ports: In1=model bundle, In2=data table, Out1=predicted table. - Bundle keys (if present): {'estimator','features','target','classes',...}; falls back to a bare estimator and infers features if absent (raises KeyError if required columns are missing). - Prediction column name: custom if configured; otherwise "Prediction (<target>)". - Probabilities: when available, adds "P (<target>=<class>)<suffix>"; may also append "<prediction> (confidence)" as max probability. Optional "Model Count" from n_estimators. - 'useSoftVoting' is informational; sklearn RandomForest averages probabilities by design. |
29 | Reference Row Splitter | reference_row_splitter.py | • Join keys are coerced to pandas 'string' dtype; NaNs in the reference key set are ignored. • If a configured column is missing, a clear KeyError is raised. |
30 | ROC Curve | roc_curve.py | Supports both KNIME view configuration variants: 1) Newer: - view/targetColumnV3 → truth column - view/predictionColumnsV2/manualFilter/manuallySelected → probability columns 2) Older: - view/targetColumn/selected → truth column - view/predictionColumns/selected_Internals → probability columns - (also checks view/predictionColumns/manualFilter/manuallySelected if present) |
31 | Row Aggregator | row_aggregator.py | Keys used (model): categoryColumn (string | null) aggregationMethod (COUNT | SUM | AVERAGE | MINIMUM | MAXIMUM) frequencyColumns/selected_Internals + manualFilter/manuallySelected → aggregation column names weightColumn (string | null) — only SUM/AVERAGE use it grandTotals (boolean) |
32 | Row Filter | row_filter.py | Supported operators (heuristic mapping): - IS_MISSING → df[col].isna() - IS_NOT_MISSING → df[col].notna() - EQ, EQUAL(S), = → numeric compare when possible; otherwise string compare - NE, NOT_EQUAL, <>, != → numeric compare when possible; otherwise string compare - GT, GREATER, > → to_numeric(df[col]) > to_numeric(value) - GE, GREATER_EQUAL, >= → to_numeric(df[col]) >= to_numeric(value) - LT, LESS, < → to_numeric(df[col]) < to_numeric(value) - LE, LESS_EQUAL, <= → to_numeric(df[col]) <= to_numeric(value) - CONTAINS → df[col].astype('string').str.contains(value, case=True, na=False) - STARTS_WITH / ENDS_WITH → df[col].astype('string').str.startswith/endswith(value, na=False) |
33 | RProp MLP Learner | mlp_learner.py | - Mapping: classcol→target; hiddenlayer→#hidden layers; nrhiddenneurons→neurons per layer; maxiter→max_iter; ignoremv→drop rows with NA in X/y; useRandomSeed/randomSeed→random_state. - Topology: hidden_layer_sizes = [n_hidden_neurons] × n_hidden_layers. - Implementation detail: scikit-learn has no RProp; uses MLPClassifier (solver='adam') as an approximation. - Features: all numeric/bool columns except target. If ignoremv=False, upstream imputation may be required (sklearn MLP does not accept NaNs). |
34 | Rule Engine | rule_engine.py | - Supported rules: TRUE => "out"; $col$ <op> value => "out" with <, <=, >, >=, =, ==, !=; $col$ LIKE "pat" (uses * as wildcard; converted to a regex). A trailing TRUE acts as default. - Column output: append to a new column if configured; otherwise replace the specified column; falls back to "RuleResult" when no name is provided. - Literals: numeric strings are emitted as numbers; everything else is a quoted Python literal. - Limitations: no AND/OR chaining, no between/in lists, no regex beyond LIKE→wildcard, and no type coercion beyond basic string/number handling. |
35 | Scorer | scorer.py | - Columns: 'first' → truth column, 'second' → prediction column (default "Prediction (<truth>)"). ignore.missing.values=true drops NA before scoring; false keeps NA (sklearn metrics may fail). - Confusion matrix labels: union of values from truth and prediction in order of appearance. |
36 | SMOTE | smote.py | - Feature/target: uses all numeric/bool columns as features and the configured class/target. - Methods: • oversample_equal → sampling_strategy='auto' (minorities up to majority) • otherwise uses rate: (0,1] → target_n ≈ rate * majority_n; >1 → target_n ≈ rate * minority_n - kNN: k_neighbors is clamped to ≤ (minority_count - 1) to avoid imblearn errors. - Fallbacks: if no target, no numeric features, single-class, or SMOTE raises, the original df is returned unchanged. |
37 | Statistics (Extended) | statistics.py | - compute_median: bool → include Median in numeric stats - filter_nominal_columns/included_names: list → which columns to treat as nominal - num_nominal-values_output: int → cap of categories per nominal column for Port 3 (occurrence table) |
38 | String Manipulation (Multi Column) | string_mamipulatioin_mc.py | - Append vs Replace: APPEND_OR_REPLACE ∈ {"APPEND_COLUMNS","REPLACE_COLUMNS"} * Append uses APPEND_COLUMN_SUFFIX (default "_transformed") - Missing handling: values are processed with pandas 'string' dtype to preserve NA - Abort flag ("Abort execution on evaluation errors"): when False, per-column exceptions are swallowed; when True, exceptions raise and stop execution. |
39 | String to Number | string_to_number.py | - Column selection: taken from model/include/included_names (present columns only). - Separators: supports custom decimal separator and optional thousands separator. - Target type: inferred from parse_type/cell_class (DoubleCell→Float64, Int/Long→Int64). - Error handling: if fail_on_error==True → raise on any parse issue; otherwise coerce to NA. - Missing values: preserved (pandas NA) via pd.to_numeric(..., errors='coerce') when not failing. |
40 | SVM Learner | svm_learner.py | - Feature coefficients only exist for linear/separable cases; for non-linear kernels we emit an empty coefficient table. - Scaling is not applied here; if KNIME’s node performs internal scaling, replicate upstream. - Random seed: SVC uses it for probability calibration; default to 1 for reproducibility. |
41 | SVM Predictor | svm_predictor.py | - Bundle keys (if present): {'estimator','features','target','classes',...}; falls back to a bare estimator and infers features if absent (raises KeyError if required columns are missing). - Prediction column name: custom if "change prediction" is true and a name is provided; otherwise "Prediction (<target>)". - Probabilities: when predict_proba exists, adds "P (<target>=<class>)<suffix>" columns. |
42 | Table View | table_view.py | This view node intentionally writes NO outputs to the workflow context – it only prints. |
43 | Value Lookup | value_lookup.py | Merge details: - To avoid dtype mismatches we always cast join keys to pandas 'string' dtype. - If caseSensitive is False we compare lowercased string keys. - We avoid name collisions by suffixing new columns with "_lkp" when needed. |
44 | X-Aggregator | x_aggregator.py | On intermediate folds: no outputs are published (is_complete=False). |
45 | X-Validation Partitioner (Loop Start) | x_partitioner.py | Dependencies & helpers • Uses scikit-learn splitters: KFold, StratifiedKFold, LeaveOneOut. • Relies on lxml for settings parsing and project helpers (first, first_el, normalize_in_ports, collect_module_imports, split_out_imports, iter_entries). |